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Conference  Introduction 


Marshall  J.  Farr,  Director 

Personnel  and  Training  Research  Programs 

Office  of  Naval  Research 


I  am  proud  to  introduce  these  Proceedings  of  the  1979  Computerized  Adaptive 
Testing  Conference.  This  was  the  third  conference  of  its  type  sponsored  by  the 
Office  of  Naval  Research  (ONR)  in  conjunction  with  various  co-sponsors,  which 
for  this  conference  included  the  Navy  Personnel  Research  and  Development  Center, 
the  Air  Force  Office  of  Scientific  Research,  the  Army  Research  Institute  for  the 
Behavioral  and  Social  Sciences,  the  Military  Enlistment  Processing  Command,  and 
the  Defense  Advanced  Research  Projects  Agency. 

The  growing  international  interest  in  computerized  adaptive  testing  was 
evidenced  by  the  fact  that  representatives  from  Australia,  Austria,  Belgium, 
Japan,  and  West  Germany  made  up  part  of  the  more  than  80  invited  participants  in 
this  conference.  Equally  impressive  was  the  widespread  representation  from  fed¬ 
eral  agencies:  In  addition  to  those  from  the  sponsors,  participants  came  from 
the  U.S.  Marine  Corps,  Air  Force  Human  Resource  Laboratories,  the  U.S.  Coast 
Guard,  the  Navy  Guided  Missile  School,  the  U.S.  Civil  Service  Commission,  and 
the  Naval  Aerospace  Medical  Research  Laboratory. 

Computerized  adaptive  testing  (CAT)  has  come  a  long  way  in  a  short  span  of 
years,  thanks  to  an  ever-burgeoning  interest  in  the  field,  which  continues  to  be 
spearheaded  by  the  ONR  contractors  represented  in  these  proceedings.  Since  the 
1977  CAT  conference,  the  Defense  Department  has  formally  recognized  its  promise. 
In  January  1979  a  memorandum  issued  at  the  level  of  the  Office  of  the  Secretary 
of  Defense  directed  "the  development  and  further  evaluation  of  the  feasibility 
of  implementing  computerized  adaptive  testing  in  the  Department  of  Defense."  The 
memorandum  went  on  to  call  for  a  Defense  Department-wide  research  and  develop¬ 
ment  program,  which  will  eventually  transform  the  Armed  Services  Vocational  Ap¬ 
titude  Battery  (ASVAB) — now  used  by  all  the  Services  for  enlisted  personnel  se¬ 
lection  and  classification — into  a  computerized  adaptive  examination.  That  im¬ 
plementation-feasibility  study  is  now  underway,  guided  by  a  steering  committee 
representing  the  Navy,  Army,  Air  Force,  Marine  Corps,  and  the  Military  Enlist¬ 
ment  Processing  Command  (MEPCOM). 

The  Office  of  Naval  Research  is  generally  acknowledged  as  one  of  the  para¬ 
mount  forces,  if  not  the  leader,  in  constructing  the  theoretical  research  foun¬ 
dations  that  make  CAT  possible.  The  proceedings  of  this  conference  demonstrate 
that  our  support  of  research  on  important  theoretical  questions  in  CAT  continues 
unabated.  I  believe  strongly  in  the  potential  of  CAT  for  possibly  revolutioniz¬ 
ing  test  administration  and  scoring  in  the  measurement  of  both  ability  and 
achievement.  I  further  believe  that  having  computers  readily  available  for  on¬ 
line  testing  will  encourage  the  development  of  new  kinds  of  test  items,  admin¬ 
istration,  and  scoring,  over  and  above  the  changes  to  be  wrought  by  CAT. 


Session  1*: 

Adaptive  Testing  Strategies  for  Measuring  Ability 

Adaptive  Verbal  Ability  Testing 
in  a  Military  Setting 

James  R.  McBride 
Navy  Personnel  Research  and 
Development  Center 

Parallel  Forms  Reliability  and 
Measurement  Accuracy  Comparison 
of  Adaptive  and  Conventional 
Testing  Strategies 

Marilyn  F.  Johnson  and 
David  J.  Weiss 
University  of  Minnesota 

A  Comparison  of  the  Accuracy  of 
Bayesian  Adaptive  and  Static 
Tests  Using  a  Correction  for 
Regression 

Steven  Gorman 
Department  of  the  Navy 

Discussion 

Brian  Waters 
Air  University 


*A  paper  entitled  "Criterion-Related  Validity  of  Conventional  and  Adaptive 
Ability  Tests  in  a  Military  Environment,"  by  James  B.  Sympson,  was  also  pre¬ 
sented  in  Session  1,  but  was  not  available  for  inclusion  in  these  Proceedings. 


Adaptive  Verbal  Ability  Testing  in  a  Military  Setting 


James  R,  McBride 

Navy  Personnel  Research  and  Development  Center 


Since  January  1976  all  military  services  have  used  a  common  battery  of  men¬ 
tal  tests  for  enlisted  personnel  selection  and  classification:  the  Armed  Ser¬ 
vices  Vocational  Aptitude  Battery  (ASVAB).  The  battery  includes  12  subtests  of 
cognitive  aptitudes.  These  subtests  are  necessarily  short;  they  are  usually 
scored  by  hand;  the  raw  scores  are  manually  converted  into  service-specific 
scaled  scores  using  conversion  tables;  and  the  scale  scores  are  manually  recor¬ 
ded  and  manually  transcribed  into  permanent  individual  personnel  records. 

The  U.S.  Marine  Corps  has  identified  some  difficulties  with  the  ASVAB  test¬ 
ing  program.  Now  that  the  ASVAB  has  supplanted  service-specific  classification 
test  batteries,  a  single  test  battery  must  serve  all  the  special  testing  needs 
of  the  four  services.  In  many  cases,  ASVAB  subtests  are  excessively  difficult 
for  Marine  Corps  selection  and  classification  purposes;  this  can  result  in  inef¬ 
ficient  and  inaccurate  classification.  There  has  been  some  compromise  of  ASVAB 
test  security:  Test  booklets  and  answer  keys  have  been  stolen.  This  problem,  if 
uncontrolled,  could  seriously  degrade  the  validity  of  the  tests  for  classifica¬ 
tion  purposes.  The  manual  nature  of  the  test  scoring,  score  conversion,  and 
score  recording  procedures  provides  opportunity  for  clerical  error,  and  it  is 
believed  that  such  errors  may  have  resulted  in  numerous  accession  errors. 

The  Marine  Corps  formulated  an  operational  requirement  to  lessen  or  elimi¬ 
nate  the  impact  of  the  problems  discussed  above.  Computer-administered  adaptive 
testing  (CAT)  was  identified  as  one  potential  solution  to  all  of  these  problems. 
In  an  adaptive  test,  test  difficulty  is  tailored  dynamically  to  the  ability  lev¬ 
el  of  the  individual  examinee;  in  principle,  then,  CAT  eliminates  the  problem  of 
excessive  test  difficulty  and  should  yield  scores  that  promote  accurate  selec¬ 
tion  and  classification  decisions.  CAT  addresses  the  test  security  problem  by 
eliminating  printed  booklets  and  scoring  keys  and  by  administering  an  individu¬ 
ally  tailored  set  of  test  items  to  each  examinee.  Additionally,  since  CAT  auto¬ 
mates  test  administration,  test  scoring  and  recording  are  automated  as  well, 
thereby  eliminating  human  clerical  error  from  the  testing  system. 

Recognizing  the  potential  of  CAT  for  selection  and  classification  testing, 
the  Marine  Corps  tasked  NPRDC  with  investigating  the  feasibility  of  CAT  as  part 
of  a  program  of  phased  research  and  development  related  to  military  personnel 
accessioning . 


The  research  reported  here  was  intended  to  assess  the  feasibility  of  using 
computerized  adaptive  testing  (CAT)  in  a  Marine  Corps  recruit/applicant  popula¬ 
tion  and,  at  the  same  time,  to  verify  the  claimed  merits  of  CAT  as  a  psychologi¬ 
cal  measurement  technique.  These  two  research  issues  could  only  be  addressed  by 
administering  adaptive  tests  to  appropriate  examinee  samples.  The  capability  to 
do  this  had  to  be  developed — equipment  identified,  software  written,  and  large 
banks  of  test  items  assembled  and  calibrated  using  item  characteristic  curve 
(ICC)  models.  After  this  development  was  completed,  a  pilot  study  involving 
verbal  ability  tests  was  conducted.  This  report  describes  the  pilot  study  of 
the  feasibility  and  psychometric  merits  of  an  adaptive  procedure  for  measuring 
verbal  ability. 

Background 

Group-administered  paper-and-pencil  "objective"  ability  tests  date  back  to 
World  War  I,  when  the  introduction  of  the  Army  Alpha  test  signalled  an  era  of 
vast  improvements  in  the  administrative  efficiency  of  psychological  testing. 

The  price  paid  for  this  efficiency  was  loss  of  flexibility,  since  all  examinees 
had  to  answer  a  common  set  of  test  questions.  The  psychometric  effect  of  this 
was  not  too  serious,  provided  that  a  test  was  designed  to  have  a  difficulty  lev¬ 
el  appropriate  to  its  intended  application  or  that  a  test  was  sufficiently  long 
to  overcome  minor  design  deficiencies.  For  persons  whose  ability  level  was  not 
near  the  target  difficulty  level  of  the  test,  however,  the  paper-and-pencil  test 
was  not  a  particularly  accurate  or  precise  measuring  instrument. 

The  psychological  tests  used  by  the  armed  services  for  selection  and  clas¬ 
sification  are  group-administered  paper-and-pencil  tests.  Such  tests,  as  just 
discussed,  lack  the  flexibility  to  measure  well  over  a  wide  range  of  ability. 

In  order  to  achieve  that  flexibility,  the  difficulty  level  of  the  test  would 
have  to  be  chosen  to  fit  individual  ability  levels.  Since  individual  ability 
levels  are  not  known  prior  to  testing,  this  is  not  practical;  however,  it  can  be 
accomplished  using  an  adaptive  test  in  which  test  items  are  chosen  sequentially 
on  the  basis  of  the  examinee's  performance.  This  sequential  item  choice  can 
best  be  accomplished  using  automated  test  administration,  for  example,  by  having 
the  test  administered  at  an  interactive  computer  terminal. 


The  historical  development  of  computer-administered  adaptive  testing  was 
reviewed  by  Weiss  and  Betz  (1973)  and  by  Wood  (1973).  Weiss  surveyed  a  variety 
of  alternative  adaptive  testing  methods  (1974)  and  summarized  a  number  of  poten¬ 
tial  advantages  of  CAT  over  conventional  paper-and-pencil  tests  (1975).  Despite 
those  potential  advantages,  most  research  into  adaptive  testing  had  been  at  the 
basic  research  level,  until  1975  when  the  U.S.  Civil  Service  Commission  began 
moving  toward  early  1980s  implementation  of  computer-based  adaptive  administra¬ 
tion  of  its  PACE  examination  (Gorham,  1975). 

The  U.S.  Civil  Service  Commission's  implementation  plans  were  based  on  re¬ 
search  conducted  by  Urry  and  his  colleagues  (e.g.,  Urry,  1977).  Urry  chose  t~ 
adopt  a  Bayesian  sequential  adaptive  testing  procedure  proposed  by  Owen  (1969, 
1975)  and  demonstrated  that  the  procedure  could  achieve  satisfactory  levels  of 
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measurement  reliability  in  substantially  less  than  half  the  number  of  items  re¬ 
quired  of  a  conventional  test;  in  one  instance  he  estimated  that  an  adaptive 
test  was  equivalent  in  reliability  to  a  conventional  test  five  times  as  long 
(Urry,  1977).  It  is  this  efficiency  of  measurement  which  has  motivated  most 
psychometric  interest  in  adaptive  testing,  although  test  users  have  often  been 
more  attracted  by  its  practical  advantages,  which  were  discussed  above. 

Marine  Corps  interest  in  CAT  for  personnel  selection  and  classification 
testing  resulted  from  dissatisfaction  with  certain  aspects  of  the  joint  service 
paper-and-penc il  testing  battery.  Subtests  used  for  selection  decisions  were 
also  used  as  a  basis  for  personnel  classification  and  assignment  to  specialized 
training;  a  test  designed  for  one  of  these  purposes  would  likely  be  inappropri¬ 
ate  for  the  other,  and  this  might  result  in  disproportionate  numbers  of  selec¬ 
tion  or  assignment  errors.  Clerical  errors  in  the  manual  scoring  and  score  re¬ 
cording  processes  were  felt  to  be  another  serious  source  of  accessioning  errors; 
and  the  effects  of  test  compromise  were  inevitable  with  the  use  of  the  same  test 
battery  over  a  period  of  several  years. 

Recognizing  that  computerized  test  administration  could  eliminate  scoring 
and  clerical  errors  and  that  adaptive  testing  could  substantially  reduce  test 
compromise,  Marine  Corps  Headquarters  tasked  NPRDC  with  evaluating  the  feasibil¬ 
ity  of  CAT  for  testing  Marine  recruits.  The  purpose  of  this  paper  is  to  report 
the  results  of  the  first  in  a  series  of  studies  investigating  both. the  feasibil¬ 
ity  and  the  utility  of  CAT  in  comparison  with  a  conventional  test  design. 

The  study  was  designed  in  part  to  address  three  research  questions:  (1)  Is 
computer-based  testing  of  military  recruits  administratively  feasible?  (2)  Is  a 
computer-administered  adaptive  test  more  reliable  than  a  conventional  test, 
holding  test  length  constant?  (3)  If  so,  what  is  an  appropriate  length  criterion 
for  an  adaptive  test? 

These  questions  were  motivated  by  the  results  of  previous  research  done 
elsewhere.  The  first  question — that  of  administrative  feasibility — seems  trivi¬ 
al  but  is  not.  Interviews  with  military  testing  personnel  indicated  some  mis¬ 
givings  about  the  ability  of  military  recruits  to  use  relatively  sophisticated 
automated  testing  equipment,  such  as  CRT  computer  terminals.  This  potential 
man-machine  interface  problem  is  the  analogue  of  administrative  difficulties 
encountered  years  earlier  with  paper-and-pencil  tailored  tests.  For  example, 

Seeley,  Morton,  and  Anderson  (1962)  found  that  a  substantial  proportion  of  their 
military  examinees  did  not  successfully  follow  instructions  on  an  experimental 
sequential  item  test;  this  experience  may  have  caused  a  five-year  lapse  in  mili¬ 
tary  research  on  tailored  or  adaptive  testing.  Olivier  (1974)  had  a  similar 
experience  using  a  paper-and-pencil  flexilevel  test  in  a  sample  of  high  school 
students . 

The  question  of  the  advantages  of  adaptive  tests  over  conventional  ones  in 
terms  of  reliability  has  a  clear  and  positive  theoretical  answer:  Holding  test 
length  and  all  else  constant,  a  good  tailored  test  design  is  superior,  provided 
that  highly  discriminating  test  items  are  available  (Urry,  1970). 

This  theoretical  advantage  is  not  always  corroborated  in  empirical  investi- 

i 

i 


gations.  For  instance,  Bryson  (1971)  questioned  the  advantage  of  tailored  test¬ 
ing  over  certain  methods  of  conventional  test  design;  Olivier  (1974)  failed  to 
find  an  advantage  for  the  flexilevel  tests  he  used;  and  the  results  reported  by 
Weiss  and  his  colleagues  have  been  less  than  unanimous  in  favor  of  adaptive 
tests.  All  these  results  are  in  contrast  with  those  of  Urry  (1977),  who  re¬ 
ported  that  for  his  sample  of  57  Civil  Service  job  applicants  an  adaptive  verbal 
ability  test  achieved  an  80%  reduction  (compared  to  a  conventional  test)  in  the 
test  length  required  to  attain  any  of  several  specified  levels  of  reliability. 
Urry's  result  was  extraordinary.  The  only  cloud  over  it  is  that  it  was  based  on 
indirect  evidence:  The  conventional  test  reliabilities  were  based  on  Spearman- 
Brown  equation  adjustments  to  the  reliability  obtained  in  an  independent  sample, 
and  the  tailored  test  reliability  was  merely  assumed,  not  rigorously  verified. 

Previous  research  into  the  reliability,  validity,  and  efficiency  of  adap¬ 
tive  tests  has  often  been  inconclusive  because  of  design  flaws  or  nuisance  fac¬ 
tors.  The  major  problem  has  on  the  lack  of  suitable  means  for  estimating  the 
adaptive  test's  reliability  without  making  dubious  assumptions.  Another  problem 
has  been  the  general  failure  to  match  adaptive  and  counterpart  conventional 
tests  in  item  quality,  with  an  unfair  advantage  usually  in  favor  of  the  adaptive 
test.  The  research  reported  here  was  intentionally  designed  to  remove  those  two 
problems — to  provide  credible  indices  of  reliability  that  are  appropriate  for 
both  test  types  and  to  provide  a  fair  comparison  by  matching  item  quality  across 
the  test  types.  With  those  two  problem  sources  eliminated,  there  is  hope  for  an 
unequivocal  comparison  between  adaptive  and  conventional  test  designs. 

Method 

The  general  method  used  was  that  of  equivalent  tests  administered  to  inde¬ 
pendent  examinee  groups.  One  group  took  two  equivalent  computer-administered 
adaptive  tests.  The  other  group  took  two  equivalent  conventional  tests,  also 
administered  by  computer.  In  order  to  control  for  item  quality,  both  test  types 
were  made  up  of  items  from  the  same  source — a  common  pool  of  150  verbal  ability 
items,  which  had  previously  been  calibrated  in  large  samples  of  Marine  recruits, 
using  ICC  methods. 

Research  Design 

Each  examinee  was  randomly  assigned  to  one  of  the  two  treatment  groups — 
Group  A  or  C.  Group  A  took  two  30-item  adaptive  verbal  ability  tests,  follow_d 
by  a  50-item  criterion  test  of  word  knowledge.  Group  C  took  two  30-item  conven¬ 
tional  verbal  ability  tests,  followed  by  the  same  criterion  test.  All  tests 
were  administered  at  a  computer  terminal.  Figure  1  is  a  schematic  representa¬ 
tion  of  the  research  design. 


Observations .  For  each  examinee  who  completed  the  tests,  the  following 
data  were  observed  and  automatically  recorded: 

1.  Elapsed  time  for  the  testing  session; 

2.  Elapsed  time  to  complete  pretest  instructions; 

3.  Number  of  errors  made  during  the  instructions; 

4.  Number  of  times  the  proctor  was  called; 
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Figure  1 

The  Research  Design  for  Administration 
of  the  Experimental  and  Criterion  Tests 

Tests 


Treatment 

Ada; 

)tive 

Conventional 

Group 

Form  1 

Form  2 

Form  1 

Form  2 

Criterion 

A 

X 

X 

X 

C 

X 

X 

X 

5.  Raw  item  scores  (correct/incorrect); 

6.  Cumulative  raw  score  after  each  item; 

7.  Latent  trait  ability  estimates  (experimental  tests  only); 

8.  Bayes  posterior  variance  of  the  ability  estimate  after  each  item;  and 

9.  Criterion  test  raw  score. 

The  format  for  these  observations  is  schematized  in  Figure  2. 


Figure  2 

Example  of  Examinee  Record  (Abbreviated) 


Raw  Ability  Posterior 

Score  Estimate  Variance 


Form:  1 

2 

1 

2 

1 

2 

Stage 

1 

0 

0 

-.69 

-.73 

.548 

.533 

2 

1 

1 

-.36 

-.37 

.401 

.394 

3 

2 

2 

-.  10 

-.20 

.332 

.318 

4 

2 

3 

-.30 

.02 

.248 

.266 

5 

3 

4 

-.  14 

.25 

.229 

.213 

6 

4 

5 

.01 

.48 

.193 

.210 

7 

4 

6 

-.17 

.65 

.160 

.184 

8 

5 

6 

-.05 

.45 

.145 

.143 

9 

5 

6 

-.22 

.26 

.124 

.115 

10 

6 

7 

-.15 

.33 

.115 

.107 

30 

20 

21 

.59 

.97 

.053 

.048 

Criterion  score 

27 

Total 

time 

57.3 

minutes 

Instruction 

time 

8.5 

minutes 

Instruction  errors  1 
Proctor  calls  0 
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Independent  variables.  For  the  comparisons  between  the  adaptive  and  conven¬ 
tional  testing  methods  there  were  two  independent  variables:  (1)  test  type 
(adaptive  versus  conventional)  and  (2)  test  length  (5,  10,  15,  20,  25,  30 
i t  ems ) . 

Within  the  adaptive  testing  method,  the  test  termination  rule  was  treated  as 
an  independent  variable  for  some  analyses:  Tests  were  terminated  (1)  at  a  fixed 
test  length  (5,  10,  ...,  30  items)  or  (2)  at  a  specified  posterior  variance 
(variable  length).  The  number-of-i terns  termination  rule  resulted,  of  course,  in 
a  test  of  predetermined  length;  and  the  posterior  variance  rule  resulted  in  a 
variable  length  test,  depending  on  the  number  of  items  required  to  attain  speci¬ 
fied  levels  of  the  Bayes  posterior  variance. 

Dependent  variables.  Measures  of  the  dependent  variables  were  formed  from 
the  individual  observations.  The  dependent  variables  included: 

1.  Testing  time; 

2.  Instruction  time; 

3.  Number  of  keyboard  errors; 

4.  Number  of  proctor  calls; 

5.  Alternate  tests  reliability  coefficient  after  5,  10,  ...,  30  items;  and 

6.  Test-criterion  correlation  after  5,  10,  . ..,  30  items. 

Procedure 


Items .  The  150  items  in  the  pool  were  calibrated  using  Urry's  ancillary 
estimation  method  and  were  selected  according  to  the  prescriptions  given  by  Urry 
(1977):  All  ICC  slope  parameters  exceeded  .80.  The  average  value  of  the  dis¬ 

crimination  (ji)  parameter  was  1.24;  item  difficulty  (location,  or  b)  parameters 
ranged  from  -2.0  to  +2.0;  and  there  were  no  items  with  a  pseudo-guessing  (c) 
parameter  greater  than  .30. 

Examinees .  Male  Marine  recruits  reporting  for  duty  at  the  Marine  Corps 
Recruit  Depot,  San  Diego,  were  the  examinees.  They  were  tested  one  at  a  time  at 
a  Burroughs  TD832  terminal  controlled  by  a  Burroughs  B 1 7 1 7  time-sharing  minicom¬ 
puter  system.  Assignment  to  groups  (Group  A  or  C)  was  randomized.  Two  hundred 
one  examinees  completed  the  tests — 96  of  these  took  the  adaptive  tests  and  105 
took  conventional  tests. 

Tests .  The  conventional  tests  administered  to  Group  C  were  rectangular 
tests  spanning  the  difficulty  range  of  the  item  pool.  This  broad  range  of  dif¬ 
ficulty  was  chosen  in  order  to  simulate  the  psychometric  design  of  the  verbal 
tests  used  in  the  ASVAB.  Two  30-item  equivalent  forms — Form  1  and  Form  2 — were 
constructed  from  the  150-item  pool.  Items  were  chosen  to  be  as  highly  discrimi¬ 
nating  as  possible,  consistent  with  the  broad  difficulty  range.  The  two  forms 
were  constructed  to  be  "weakly  parallel"  (Samejima,  1977),  i.e.,  to  have  approx¬ 
imately  equal  test  information  functions.  Within  each  form,  the  30  items  were 
sorted  into  five  difficulty  levels,  then  arranged  in  descending  order  of  dis¬ 
criminating  power  within  each  level.  The  first  five  items  in  each  form  were  the 
most  discriminating  items  at  their  respective  difficulty  levels;  items  6  through 


10  - 


10  were  the  second  most  discriminating  items  at  each  level;  and  so  on.  This 
arrangement  resulted  in  two  30-item  tests  consisting  of  a  sequence  of  six  5-item 
subsets  each.  This  design  was  intended  to  permit  meaningful  analysis  of  the 
psychometric  properties  of  rectangular  conventional  tests  of  lengths  of  5,  10, 
15,  20,  25,  and  30  items.  In  order  to  equalize  any  effects  due  to  test  length, 
fatigue,  or  other  extraneous  factors,  the  two  conventional  tests  were  adminis¬ 
tered  in  counterbalanced  item  order,  i.e.,  the  two  30-item  tests  were  adminis¬ 
tered  as  one  60-item  test  in  the  following  order: 

Item  sequence:  12345678... 

Test  Form:  1  2  2  1  2  1  1  2  ... 

The  two  30-item  adaptive  tests  were  based  on  Owen's  (1969,  1975)  Bayesian 
sequential  tailored  testing  procedure.  For  each  examinee  and  each  test  form  an 
initial  normal  prior  distribution  of  ability  was  assumed,  with  mean  0  and  vari¬ 
ance  1.0.  The  test  form  (either  1  or  2)  was  counterbalanced  for  each  examinee 
in  a  manner  identical  to  that  of  the  conventional  tests:  12212112....  Both 
forms  of  the  Bayesian  test — Form  1  and  Form  2 — drew  items  from  the  same  150-item 
pool;  counterbalancing  the  order  of  administration  here  served  the  added  purpose 
of  equalizing  item  quality  across  the  two  forms.  The  two  adaptive  tests  were 
independent  of  each  other  except  for  their  use  of  a  common  item  pool. 

The  criterion  test  was  formed  by  concatenating  two  obsolete  operational 
test  forms  measuring  word  knowledge.  This  resulted  in  a  50-item  test  expected 
to  be  a  highly  reliable  and  fairly  broad-range  test  of  an  important  facet  of 
verbal  ability. 


Results  and  Discussion 


Feasibility 

Data  pertaining  to  the  feasibility  of  using  computer  terminals  to  admini¬ 
ster  tests  to  military  recruits  are  summarized  in  Table  1.  Mean  testing  time 
was  61.0  minutes  for  the  adaptive  test  group  versus  50.4  minutes  for  the  conven¬ 
tional  test  group.  These  were  the  mean  times  to  answer  110  items — 60  items  from 
either  the  adaptive  or  the  conventional  alternate  forms,  followed  by  50  criteri¬ 
on  test  items  common  to  both  groups.  The  adaptive  tests  required  about  11  more 
seconds  per  item,  or  as  much  as  39%  longer  to  answer  than  the  conventional 
tests.  Some  or  all  of  this  difference  may  have  been  due  to  computations  re¬ 
quired  for  adaptive  item  selection,  but  this  result  does  agree  generally  with 
Waters'  (1977)  finding  that  an  adaptive  test  required  significantly  longer  exam¬ 
inee  processing  per  item  than  a  similarly  administered  conventional  test.  In 
the  present  study,  however,  the  observed  time  difference  may  be  due  in  large 
part  to  idiosyncrasies  of  the  computer  system;  if  so,  differences  of  the  size 
reported  here  would  not  be  expected  if  a  faster  computer  were  used  to  control 
and  to  administer  the  adaptive  tests. 

Instruction  time  averaged  9.5  minutes  for  the  adaptive  test  group  and  10.3 
minutes  for  the  conventional  group;  overall,  the  instructions  required  an  aver¬ 
age  of  9.9  minutes.  During  this  time,  the  examinees  were  familiarized  with  the 
CRT  and  keyboard  by  means  of  a  programmed  instructional  sequence  with  special 
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Table  1 


Testing  Time  and  Examinee  Error  Summary  for 
Computer-Administered  Test  Sessions 


Data 

Group 

A 

Adap¬ 

tive 

and  Test 

C 

Conven¬ 

tional 

Overall 

Number  of  examinees 

96 

105 

201 

Mean  time  (minutes) 

Total 

70.5 

60.7 

Instruct  ions 

9.5 

10.  3 

9.9 

Testing 

61.0 

50.4 

Errors 

Procedural  errors 

25 

30 

55 

Proctor  calls 

5 

12 

17 

Note.  Each  session  consisted  of  programmed  instruction, 
60  experimental  test  items,  and  a  50-item 
criterion  test. 


branching  following  procedural  errors  and  with  an  audible  call  to  the  proctor  i 
the  examinee  had  difficulty  correcting  an  error.  Errors  and  proctor  calls  were 
counted.  As  the  table  indicates,  there  were  55  errors  in  all,  in  201  test  ses¬ 
sions;  in  only  17  cases  was  the  proctor  called.  This  amounts  to  about  one  pro¬ 
cedural  error  per  4  test  sessions  and  to  a  requirement  for  proctor  intervention 
about  one  time  per  12  test  sessions. 

Psychometric  Characteristics 


Reliability.  Table  2  summarizes  reliability  and  criterion  validity  data 
for  both  the  adaptive  and  conventional  alternate  forms  tests  at  lengths  of  5, 
10,  15,  20,  25,  and  30  items. 


Table  2 

Psychometric  Characteristics  of  the  Computer-Administered 


Verbal  Ability 

Tests  as 

a  Function  of 

Test 

Type  and 

Test 

Length 

Psychometric 

Characteristic 
and  Test 

N 

5 

10 

Test  Length 

15  20 

25  30 

Reliability 

Adaptive 

96 

.  79 

.87 

.88 

.90 

.91  .91 

Conventional 

105 

.59 

.73 

.80 

.83 

.86  .89 

Validity 

Adaptive 

93 

.77 

.82 

.83 

.84 

.85  .85 

Conventional 

103 

.73 

.81 

.84 

.85 

.85  .87 

Relative  efficiency 

2.70 

2.50 

1.90  1 

.80 

1.70  1.30 
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Reliability  was  operationalized  as  the  correlation  between  scores  on  alter¬ 
nate  forms  at  a  given  test  length.  The  scoring  procedure  used  was  the  same  for 
both  test  types — latent  ability  estimation  using  the  sequential  estimation  for¬ 
mulae  developed  by  Owen  (1969).  From  the  table  it  is  clear  that  the  adaptive 
tests  had  substantially  higher  reliability  coefficients  than  the  conventional 
tests  for  any  given  test  length.  Viewing  these  data  another  way,  it  can  be  seen 
that  the  adaptive  test  reliability  at  a  5-item  test  length  was  practically 
equivalent  to  the  conventional  test's  reliability  at  15  items;  similarly,  the 
adaptive  test's  reliability  at  a  length  of  10  was  superior  to  that  of  the  con¬ 
ventional  test  at  a  length  of  25. 

Figure  3  contains  a  graphic  comparison  of  the  adaptive  and  conventional 
tests  in  terms  of  alternate  forms  reliability  as  a  function  of  test  length. 
Analysis  of  Table  2  and  Figure  3  indicates  that  in  terms  of  test  length  required 
to  attain  a  given  level  of  reliability,  the  adaptive  tests  had  a  substantial 
advantage  over  the  conventional  tests.  This  advantage  was  essentially  the  same 
for  both  fixed  length  and  variable  length  stopping  rules;  there  was  no  apparent 
advantage  to  variable  length,  as  opposed  to  fixed  length,  within  the  adaptive 
testing  method. 

Figure  3 

Alternate  Forms  Reliability  Plotted  as  a  Function  of 
Test  Length  for  the  Conventional  and  Adaptive  Tests 


Relative  efficiency.  Thus,  the  adaptive  tests  achieved  specific  levels  of 
reliability  more  efficiently  than  the  conventional  tests.  How  much  more  effi- 
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ciently  is  indicated  in  row  3  of  the  table,  labeled  "relative  efficiency."  These 
data,  based  on  the  Spearman-Brown  equation,  estimate  for  each  test  length  how 
much  the  conventional  tests  would  have  to  be  lengthened  to  attain  the  reliabil¬ 
ity  of  the  adaptive  tests.  For  example,  the  adaptive  test  reliability  at  5 
items,  .79,  was  estimated  to  be  equivalent  to  that  of  a  conventional  test  2.70 
times  as  long,  or  13.5  items  in  length.  Notice  that  the  relative  efficiency  of 
these  adaptive  tests  always  exceeds  unity  but  diminishes  as  test  length  in¬ 
creases.  Thus,  the  adaptive  tests  are  more  advantageous,  at  least  in  terms  of 
relative  efficiency,  at  fairly  short  test  lengths.  At  lengths  of  10  or  fewer 
items,  these  adaptive  tests  were  at  least  2.5  times  as  efficient  as  the  conven¬ 
tional  tests.  At  lengths  of  15  and  more,  however,  the  advantage,  although  still 
appreciable,  is  not  quite  so  striking. 

Validity.  The  advantage  of  adaptive  tests  was  not  so  clear  when  the  valid¬ 
ity  of  the  two  test  types  is  compared.  Validity  was  operationalized  as  the  cor¬ 
relation  between  test  scores  and  the  examinee's  raw  score  on  the  concurrently 
administered  50-item  Word  Knowledge  test.  From  their  superior  reliability,  it 
would  be  expected  that  the  adaptive  tests  would  also  be  superior  in  validity  at 
any  constant  test  length.  As  Table  2  indicates,  the  adaptive  tests  had  higher 
validities  at  test  lengths  up  to  10  items;  at  lengths  of  15  and  up,  however,  the 
conventional  tests  had  slightly  higher  validity.  None  of  the  validity  differ¬ 
ences  was  statistically  significant  at  the  .05  level. 

Conclusions 


Based  on  the  data  reported  above,  several  conclusions  are  offered  with  re¬ 
gard  to  the  feasibility  and  psychometric  merits  of  adaptive  aptitude  testing  of 
Marine  recruits. 

1.  Testing  Marine  recruits  with  CRT  terminals  is  feasible  from  both  prac¬ 
tical  and  human  engineering  standpoints.  Embedded  programmed  instructions  can 
effectively  teach  the  recruits  the  use  of  the  testing  terminals.  The  number  of 
proctors  or  attendants  required  to  supervise  and  to  assist  in  the  testing  room 
appears  to  be  acceptably  small. 

2.  Striking  psychometric  efficiency  was  demonstrated  for  the  adaptive 
tests  of  verbal  ability  used  in  this  study.  It  appears  that  in  military  person¬ 
nel  testing  applications,  well-constructed  short  adaptive  tests  can  achieve  high 
levels  of  measurement  reliability  with  less  than  half  the  number  of  items  re¬ 
quired  using  conventional  testing  procedures. 

3.  There  is  no  apparent  psychometric  advantage  to  the  intuitively  appeal¬ 
ing  notion  of  variable-length  adaptive  tests,  at  least  for  the  adaptive  testing 
method  used  here. 

4.  Short  fixed-length  adaptive  tests  of  about  10  items  per  examinee  seem 
to  be  sufficiently  reliable  for  personnel  testing  purposes.  The  adaptive  tests 
achieved  a  minimally  satisfactory  reliability  level  (.80)  in  just  5  items;  addi¬ 
tional  test  lengths  beyond  10  items  did  not  yield  psychometric  returns  propor¬ 
tional  to  the  added  administration  time  required. 
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Parallel  Forms  Reliability  and  Measurement  Accuracy 
Comparison  of  Adaptive  and  Conventional  Testing  Strategies 

Marilyn  F.  Johnson  and  David  J.  Weiss 
University  of  Minnesota 


Prior  research  at  the  University  of  Minnesota  has  compared  the  parallel 
forms  reliabilities  of  adaptive  and  conventional  vocabulary  tests  as  a  function 
of  test  length.  The  results  are  shown  in  Figure  1,  which  displays  alternate 
forms  reliabilities  of  Owen's  Bayesian  adaptive  test  and  a  conventional  test  as 
a  function  of  number  of  items  administered.  The  conventional  test  was  peaked  in 
information  at  0  =  0.0;  and  test  items  were  administered  in  order  of  informa¬ 
tion,  from  high  to  low  values.  The  Bayesian  adaptive  test  was  scored  by  Bayes¬ 
ian  methods;  whereas  the  conventional  test  was  scored  by  both  proportion-correct 
and  Bayesian  methods.  Both  tests  consisted  of  five-alternative  multiple-choice 
vocabulary  items. 

As  expected,  the  plots  in  Figure  1  show  an  increase  in  reliability  as  test 
length  increased  for  both  testing  strategies.  However,  rather  than  the  expected 
asymptote  of  reliabilities  for  both  strategies  as  test  length  increased,  the 
reliability  of  the  Bayesian  adaptive  test  surpassed  that  of  the  conventional 
test.  The  approximate  difference  in  reliabilities  at  test  termination  was  r_  * 
.05,  with  a  30-item  reliability  of  .92  for  the  Bayesian  test  and  .87  for  the 
conventional  test  scored  by  the  Bayesian  method.  The  difference  in  reliabili¬ 
ties  between  Bayesian  and  proport ion-correct-scored  conventional  tests  was  .04 
at  the  30-item  test  length. 

The  analysis  also  included  a  comparison  of  concurrent  validity  obtained  by 
correlating  the  ability  estimates  with  number-correct  scores  on  a  120-item  vo¬ 
cabulary  criterion  test  also  composed  of  five-alternative  multiple-choice  ques¬ 
tions.  These  results  (see  Figure  2)  indicated  that  although  the  Bayesian  adap¬ 
tive  test  was  more  reliable  than  the  conventional  tests,  the  conventional  tests 
yielded  higher  validities  when  correlated  with  the  criterion  test.  Figure  2 
shows  that  the  validities,  similar  to  the  reliabilities,  increased  as  a  function 
of  test  length,  with  the  conventional  test  yielding  higher  validities  after  four 
items.  The  validity  of  the  Bayesian  test  at  30  items  was  .797;  that  of  the 
Bayesian-scored  conventional  test  was  .834;  and  the  proportion-correct-scored 
conventional  test  obtained  a  validity  of  .841. 

Purpose 


Due  to  the  apparently  contradictory  nature  of  these  findings,  the  present 
research  was  designed  to  replicate  them.  There  were,  however,  some  modifies- 
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Figure  1 

Alternate  Forms  Reliabilities  of  Ability  Level 
Estimates  from  a  Bayesian  Adaptive  Test  and 
a  Conventional  Test  Scored  by  Proportion-Correct 
and  Bayesian  Scoring,  as  a  Function  of  the  Number 
of  Items  Administered 


tions  to  the  basic  design  of  the  comparison  study,  and  an  additional  dependent 
variable,  measurement  accuracy,  was  used  to  compare  the  testing  strategies.  In 
addition,  the  present  study  compared  peaked  conventional,  Bayesian  adaptive,  and 
maximum  information  adaptive  testing  strategies.  The  conventional  test  was  also 
peaked  in  information  evaluated  at  6  ■  0.0.  Items  on  the  conventional  test  were 
administered  in  order  of  item  information  but,  for  purposes  of  analysis,  were 
arranged  in  random  order.  The  item  pool  was  composed  of  the  same  items  that 
were  used  in  the  original  study,  but  they  were  reparameterized  after  the  origi¬ 
nal  study  and  prior  to  the  present  investigation  (Prestwood  &  Weiss,  1977). 
Comparisons  of  the  three  testing  strategies  were  made  in  terms  of  parallel  forms 
reliability  as  a  function  of  test  length  and  in  terms  of  measurement  accuracy  as 
a  function  of  0  level.  Accuracy  of  measurement  was  operationalized  as  the  pos¬ 
terior  variance  of  the  Bayesian-scored  testing  strategies  and  as  standard  errors 
of  measurement  for  the  maximum  likelihood-scored  testing  strategies.  Compari- 
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Figure  2 

Correlations  of  Ability  Level  Estimates 
from  a  Bayesian  Adaptive  Test  and  a  Conventional  Test 
Scored  by  Proportion-Correct  and  Bayesian  Scoring 
with  Criterion  Test  Score, 
as  a  Function  of  the  Number  of  Items  Administered 
(Averaged  Across  Two  Test  Forms) 


sons  of  scoring  strategies,  including  Bayesian,  maximum  likelihood,  and  propor¬ 
tion-correct  scoring,  were  made  on  the  basis  of  parallel  forms  reliability. 

Method 

Subjects 

Undergraduate  and  graduate  students  from  the  University  of  Minnesota  volun¬ 
teered  to  participate  in  the  fall  1978  and  winter  1979  quarters.  These  students 
were  recruited  from  Introductory  Biology  1-011,  Introductory  Psychology  1-001, 
and  a  measurement  course,  Psychology  5-862.  Students  from  the  introductory  psy¬ 
chology  and  biology  courses  participated  in  the  study  in  order  to  obtain  experi¬ 
mental  points,  which  counted  toward  their  final  grade.  Volunteers  from  the  mea¬ 
surement  course,  bcth  graduate  and  undergraduate  students,  participated  at  the 
request  of  the  instructor. 


-  19 


There  were  373  students  In  the  conventional  testing  condition,  390  in  the 
Bayesian  testing  condition,  and  233  in  the  maximum  information  testing  condi¬ 
tion.  Testing  spanned  two  quarters  in  order  to  obtain  an  adequate  number  of 
students;  a  total  of  996  students  were  tested  during  this  period.  Although  stu¬ 
dents  were  recruited  from  varying  subject  pools,  no  difference  in  population  was 
suggested  because  the  undergraduate  students  were  all  from  the  College  of  Liber¬ 
al  Arts.  In  addition,  students  were  sequentially  assigned  to  one  of  the  three 
testing  strategies.  The  introductory  biology  and  psychology  students  also  par¬ 
ticipated  in  other  studies  during  their  experimental  hour.  In. the  case  of  the 
biology  students,  the  experimental  tests  for  this  study  were  administered  after 
a  biology  test.  The  fall  1978  introductory  psychology  students  participated 
solely  in  this  experiment,  whereas  the  winter  1979  introductory  psychology  stu¬ 
dents  first  took,  the  experimental  test  for  this  study,  and  then  took  another 
test.  In  each  case,  only  data  from  the  alternate  forms  verbal  ability  tests 
were  analyzed. 

Procedure 


All  students  took  the  tests  at  an  individual  cathode-ray  terminal  (CRT) 
connected  to  a  Hewlett-Packard  real-time  computer  system.  A  test  proctor  was 
present  during  testing  to  provide  assistance  to  the  examinees.  The  students 
were  assured  that  they  could  take  as  much  time  as  necessary  to  complete  the 
tests.  Prior  to  administration  of  items  on  the  first  test,  however,  instruc¬ 
tional  screens  explaining  the  operation  of  the  CRTs  were  displayed.  After  stu¬ 
dents  reviewed  the  test  instructions  and  responded  to  a  number  of  identification 
and  demographic  questions,  the  experimental  tests  were  administered.  Students 
responded  to  the  five-alternative  multiple-choice  vocabulary  questions  by  typing 
a  number  into  the  CRT  corresponding  to  the  chosen  alternative. 

Item  Pools 


Adaptive  test.  The  Bayesian  and  maximum  information  tests  used  the  same 
item  pool  from  which  to  select  items.  The  pool  was  composed  of  256  items  se¬ 
lected  for  the  purposes  of  this  study  from  the  total  vocabulary  pool,  which  con¬ 
tained  358  items.  The  358  items  were  newly  parameterized  items,  based  on  com¬ 
bined  data  sources  from  conventional  tests  administered  between  fall  1969  and 
winter  1978.  The  items  were  parameterized  with  Urry's  (1977)  ESTEM  program  us¬ 
ing  a  3-parameter  logistic  ICC  model.  All  items  were  assumed  to  have  a  guessing 
parameter  of  £  =  .20.  (Details  regarding  the  parameterization  procedure  can  be 
found  in  Prestwood  &  Weiss,  1977.)  Selection  of  items  from  the  larger  pool  was 
based  on  several  criteria,  which  varied  by  difficulty  levels  of  the  items.  Be¬ 
cause  there  were  few  very  difficult  or  very  easy  items,  fewer  items  at  these 
extremes  on  the  difficulty  continuum  were  eliminated.  Items  with  discrimination 
parameters  of  a  *  3.00  were  routinely  rejected  because  this  value  was  identified 
as  a  statistical  artifact  of  the  parameterization  program  and  not  as  a  true  re¬ 
flection  of  the  item's  discrimination  value. 

Based  on  a  stratification  of  the  items  into  difficulty  levels,  items  were 
eliminated  if  their  discriminations  were  low.  This  criterior.,  however,  varied 
by  difficulty  level.  In  Levels  6  and  7,  items  were  omitted  if  the  discrimina¬ 
tion  parameter  fell  below  a  =  .30.  In  Levels  3,  4,  and  5,  where  there  were  more 


Items,  the  culling  criterion  was  set  at  a  =  .35.  In  these  levels,  also,  items 
were  omitted  if  the  sample  size  on  which  the  parameters  were  calibrated  was  less 
than  100.  In  many  cases  the  items  rejected  on  the  basis  of  sample  size  were 
also  of  low  discrimination. 

Conventional  test.  The  alternate  forms  of  the  conventional  test  were  each 
composed  of  30  vocabulary  items  arranged  in  descending  order  of  item  information 
evaluated  at  0  =  0.0.  The  60  most  informative  items  at  0  =  0.0  were  selected 
from  the  vocabulary  pool  composed  of  256  items.  By  this  procedure,  items  with 
relatively  higher  discrimination  levels  and  difficulties  of  about  b  **  0.0  were 
selected.  Each  test  was  thus  peaked  with  respect  :o  item  information.  Items 
were  ordered  by  information  at  0  *  0.0,  and  the  60  items  were  divided  into  Test 
Form  A  and  Test  Form  B  according  to  an  ABBABAAB  selection  scheme.  This  proce¬ 
dure  was  used  to  insure  that  the  alternate  forms  did  not  systematically  differ 
in  item  information.  The  items  were  administered  in  order  of  descending  item 
information.  However,  for  purposes  of  analysis,  pairs  of  items  from  the  two 
test  forms  were  randomly  formed  to  simulate  conventional  paper-and-pencil  test¬ 
ing  conditions.  The  conventional  test  items  were  selected  from  the  adaptive 
test  pool  so  that  it  was  possible  that  adaptive  test  items  could  also  be  used  in 
the  conventional  test,  since  an  independent  groups  design  was  being  used. 

Adaptive  Testing  and  Scoring  Strategies 

Alternate  forms  of  the  adaptive  tests  were  dynamically  selected  from  the 
item  pool  by  a  special  algorithm.  Using  an  ABBABAAB  rotational  scheme.  Form  A 
of  the  adaptive  test  was  given  an  opportunity  to  select  an  item  from  the  pool  of 
unadministered  items,  based  on  the  item  selection  algorithm  (Bayesian,  maximum 
information)  in  use;  and  the  ability  estimate  for  that  form  of  the  adaptive  test 
was  updated.  For  administration  of  the  next  item  to  a  testee,  Form  B  then  se¬ 
lected  an  item  from  the  current  pool  of  unadministered  items;  and  the  ability 
estimate  for  that  form  was  updated.  This  procedure  continued,  using  the  ABBA¬ 
BAAB  rotation,  until  30  items  were  administered  for  each  of  the  alternate 
forms — Form  A  and  Form  B — and  the  ability  estimates  for  each  form  were  saved 
after  each  item  was  administered. 

Bayesian  adaptive  testing  strategy.  Items  were  selected  and  scored  during 
the  adaptive  procedure  according  to  Owen's  (1975)  Bayesian  model.  The  prior 
distribution  of  ability  was  assumed  to  be  normal,  with  a  mean  of  0.0  and  a  vari¬ 
ance  of  1.00.  These  values  served  as  initial  estimates  of  ability  at  the  start 
of  testing  for  each  of  the  two  forms  for  each  individual.  Testing  was  termi¬ 
nated  after  30  items  had  been  administered  for  each  of  the  two  forms.  (Details 
concerning  the  Bayesian  scoring  algorithm  can  be  found  in  McBride  &  Weiss, 

1976.) 

Maximum  information  adaptive  testing  strategy.  Items  were  selected  accord¬ 
ing  to  a  maximum  information  item  selection  routine,  and  ability  estimates  were 
updated  by  scoring  the  responses  by  maximum  likelihood  methods  (Bejar  &  Weiss, 
1979).  The  initial  estimate  of  ability  was  0.0  for  each  form.  Testing  was  ter¬ 
minated  after  30  items  had  been  adminstered  for  each  of  the  two  alternate  forms. 

The  adaptive  tests  were  scored  after  testing  by  a  scoring  strategy  other 
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Chan  the  one  used  during  testing.  The  Bayesian  test  protocols  were  scored  by 
maximum  likelihood  methods,  and  the  maximum  information  test  protocols  were 
scored  by  Bayesian  methods.  Scores  were  calculated  after  each  of  the  30  items 
in  both  parallel  tests.  Responses  to  the  two  alternate  forms  of  the  convention¬ 
al  test  were  also  rescored  by  Bayesian  and  maximum  likelihood  scoring  methods  at 
each  test  length  from  1  to  30  items. 

Independent  Variables 


Testing  strategy  was  the  major  independent  variable  of  interest.  The 
strategies  compared  were  the  conventional,  Bayesian,  and  maximum  information 
testing  strategies.  Methods  of  scoring  were  also  compared.  These  included  lo¬ 
gistic  maximum  likelihood  scoring,  Bayesian  scoring,  and  (for  the  conventional 
test)  proportion-correct  scoring.  Test  length  was  a  third  independent  variable 
of  interest.  Thirty  test  lengths  were  obtained  by  scoring  each  30-item  test  30 
times.  That  is,  a  test  was  scored  after  the  first  item,  after  the  first  two 
items,  after  the  first  three  items,  and  so  on  until  30  scores  were  obtained.  In 
this  way,  30  test  lengths,  varying  from  1  to  30  items,  were  generated  for  each 
of  the  alternate  forms. 

Dependent  Variables 


Parallel  forms  reliabilities.  Testing  strategies  were  compared  on  the  ba¬ 
sis  of  parallel  forms  reliability  by  correlating  corresponding  ability  estimates 
obtained  from  Forms  A  and  B  for  a  given  testing  strategy.  Since  the  test  proto¬ 
cols  were  scored  in  at  least  two  ways,  Bayesian  and  maximum  likelihood,  a  total 
of  seven  testing-scoring  conditions  were  compared  on  the  basis  of  parallel  forms 
reliability.  Scoring  strategy  was  compared  on  the  basis  of  parallel  forms  reli¬ 
ability  by  comparing  reliabilities  of  a  single  testing  strategy  scored  by  more 
than  one  method.  Three  of  the  parallel  forms  reliabilities  paired  the  appropri¬ 
ate  scoring  method  with  each  of  the  three  testing  strategies.  These  were  pro¬ 
portion-correct  scoring  of  conventional  tests,  maximum  likelihood  scoring  of 
maximum  information  tests,  and  Bayesian  scoring  of  Bayesian-administered  tests. 

The  remaining  four  parallel  forms  reliabilities  were  obtained  by  scoring 
the  test  protocols  by  a  scoring  routine  other  than  the  appropriate  one.  In  this 
way,  reliabilities  were  obtained  for  the  Bayesian-scored  maximum  information 
test,  the  maximum-likelihood-scored  Bayesian  test,  the  Bayesian-scored  conven¬ 
tional  test,  and  the  maximum-likelihood-scored  conventional  test.  Proportion- 
correct  scores  were  not  obtained  for  adaptive  tests.  Reliabilities  were  calcu¬ 
lated  as  a  function  of  test  length.  That  is,  reliability  was  calculated  not 
only  from  end-of-test  ability  estimates  but  also  for  each  of  the  30  test 
lengths.  Scoring  method  correlations  were  obtained  by  correlating  estimates 
obtained  from  different  scorings  of  the  same  testing  strategy.  These  correla¬ 
tions  were  used  to  analyze  the  similarity  of  ability  estimates  obtained  from 
different  scoring  techniques  applied  to  a  single  set  of  data. 

Errors  of  measurement.  The  three  testing  strategies  were  compared  on  the 
basis  of  their  errors  of  measurement.  This  was  assessed  by  two  methods— one 
method  estimated  errors  of  measurement  on  the  basis  of  maximum  likelihood  scor¬ 
ing  methods;  and  the  other,  by  Bayesian  scoring  methods.  In  the  first  method, 
test  protocols  were  scored  by  maximum  likelihood  methods,  and  the  standard  er- 
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rors  of  measurement  (SEM)  associated  with  each  ability  estimate  was  calculated. 
These  values  are  the  reciprocal  of  the  square  root  of  test  information  at  a  giv¬ 
en  0  level.  They  indicate  how  accurate  the  estimate  is  and  how  much  it  is  like¬ 
ly  to  vary  from  the  true  0  value;  the  larger  the  standard  error,  the  more  likely 
the  estimate  will  be  inaccurate. 

The  SEM  values  were  averaged  within  each  of  20  0  intervals  ranging  from 
approximately  -3.0  to  +2.0,  and  the  mean  SEM  values  were  then  plotted  as  a  func¬ 
tion  of  0.  This  was  done  on  a  single  randomly  chosen  parallel  form  for  each  of 
the  three  testing  strategies. 

The  posterior  variance  of  the  Bayesian  ability  estimate  was  also  used  to 
compare  the  testing  strategies  on  the  basis  of  measurement  accuracy.  Posterior 
variances  were  averaged  within  each  of  20  0  intervals  ranging  from  -2.0  to  +2.0. 
These  mean  values  were  plotted  at  the  midpoint  of  the  0  intervals  and  the  points 
were  connected  to  yield  a  continuous  line.  The  posterior  variance  is  analogous 
in  meaning  and  interpretation  to  the  standard  errors  of  measurement. 

Although  one  or  the  other  of  these  measurement  accuracy  indices  might  have 
been  adequate  in  comparing  the  testing  strategies,  both  were  included  to  mini¬ 
mize  any  biased  conclusions  regarding  measurement  accuracy  of  the  adaptive 
tests.  In  general,  posterior  variance  of  Bayesian  ability  estimates  will  be 
less  when  items  are  selected  according  to  a  Bayesian  testing  strategy  than  when 
items  are  selected  by  any  other  adaptive  procedure.  Use  of  the  posterior  vari¬ 
ance  alone  in  the  comparison  of  the  adaptive  testing  strategies  may  bias  conclu¬ 
sions  toward  the  Bayesian  testing  strategy.  For  this  reason  the  standard  errors 
of  measurement  was  also  used  as  an  index  of  measurement  accuracy.  This  index, 
in  general,  will  favor  the  maximum  information  testing  strategy  because  items 
were  selected  and  scored  according  to  a  maximum  likelihood  testing  procedure. 

Results 


Were  the  Tests  Parallel? 

Several  analyses  were  performed  to  determine  whether  the  alternate  forms 
were  functioning  as  parallel  forms.  These  included  comparisons  of  the  means  and 
variances  of  the  ability  estimates  as  a  function  of  test  length  for  the  alter¬ 
nate  forms  of  each  testing  strategy. 

Score  means.  In  general,  the  score  means  of  the  three  testing  strategies — 
conventional,  Bayesian,  and  maximum  information— showed  an  adequate  level  of 
parallel  relationship  between  Forms  A  and  B.  Because  the  proportion  correct 
score  metric  differs  from  the  0  metric,  the  adaptive  and  conventional  mean  abil¬ 
ity  estimates  are  not  directly  comparable.  Adaptive  test  comparisons  of  the 
means  (Figure  3)  show  that  there  were  greater  differences  between  mean  ability 
estimates  for  the  alternate  forms  of  the  maximum  information  testing  strategy 
than  for  the  Bayesian  testing  strategy;  this  was  because  of  the  tendency  of  the 
Bayesian  item  selection  and  scoring  routine  to  yield  conservative  estimates  of 
ability.  As  cesting  progressed,  however,  differences  between  the  ability  esti¬ 
mates  for  the  two  alternate  forms  of  each  test  decreased  for  both  adaptive 
test3.  Figure  3  also  shows  that  the  Bayesian  mean  ability  estimates  fell  be- 


tween  the  Form  A  and  Form  B  means  from  the  maximum  information  testing  strategy. 
Thus,  both  adaptive  procedures  yielded  about  the  same  average  ability  estimates 
for  the  students  selected  from  a  common  population. 

Figure  3 

Mean  Ability  Estimates  from  Parallel  Forms  A  and  B 
of  Maximum  Information  and  Bayesian  Adaptive  Tests, 
as  a  Function  of  Number  of  Items  Administered 


Number  of  Items  Administered 


Means  of  the  conventional  parallel  forms  were  obtained  by  averaging  propor¬ 
tion-correct  scores  at  each  of  30  test  lengths,  based  on  randomly  ordered  items. 
Figure  4  shows  that  mean  proportion-correct  scores  stabilized  to  a  final  value 
of  .43. 


Score  variances.  Variances  of  the  ability  estimates  from  the  maximum  in¬ 
formation  testing  strategy  (Figure  5)  were  relatively  high  up  to  3  items,  and 
then  decreased  steadily.  The  greatest  difference  in  variance  between  the  two 
alternate  forms  was  at  3  items  (1.25);  whereas  at  30  items  the  difference  was 
only  half  (.75).  Figure  5  also  shows  that  ability  score  variances  decreased 
from  the  beginning  to  the  end  of  the  test.  Thus,  score  variances  from  the  maxi¬ 
mum  information  tests  showed  both  a  decrease  in  difference  between  alternate 
forms  and  a  decrease  in  amount  of  variance  as  testing  proceeded. 


In  comparison  to  the  ability  scores  from  the  maximum  information  test, 
variance  in  Bayesian  ability  scores  showed  a  similar  maximum  difference  in  vari- 
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Figure  4 

Mean  Proportion-Correct  Score  of  the  Conventional  Test 
for  Alternate  Forms  A  and  B, 
as  a  Function  of  Number  of  Items  Administered 


ance  for  tests  of  about  5  items  in  length,  followed  by  decreased  differences,  as 
shown  in  Figure  5.  Level  of  variance  increased,  however,  as  testing  proceeded, 
reflecting  the  reduced  dependence  of  the  Bayesian  ability  estimates  on  the  prior 
ability  estimate.  The  restriction  in  Bayesian  ability  estimates  due  to  the  re¬ 
gression  effect  was  still  evident  even  at  30-item  test  lengths,  since  the  abili¬ 
ty  estimate  variances  for  the  Bayesian  tests  were  substantially  lower  than  those 
of  the  maximum  information  tests. 

Proportion  correct  score  variance  of  both  parallel  forms  of  the  convention¬ 
al  test  decreased  rapidly,  from  a  possible  maximum  of  .25  at  1  item  to  .06  at  30 
items,  as  shown  in  Figure  6.  Based  on  both  the  score  means  and  score  variances, 
the  alternate  forms  of  the  conventional  test  were  closer  to  being  parallel  than 
the  alternate  forms  of  either  of  the  adaptive  tests. 

Errors  of  measurement  as  a  function  of  test  length.  Samejima  (1977)  de¬ 
fines  weakly  parallel  tests  as  tests  that  yield  the  same  information  functions. 
Thus,  evidence  for  the  parallel  relationship  between  the  adaptive  forms  included 
examination  of  their  errors  of  measurement  as  a  function  of  number  of  items  ad¬ 
ministered.  Average  standard  error  of  measurement,  the  reciprocal  of  the  square 
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root  of  theoretical  test  Information,  was  used  to  compare  alternate  forms  of  the 
maximum  information  testing  strategy.  The  error  of  measurement  curves  for  the 
maximum  information  tests  (Figure  7)  showed  the  same  form  with  variance  decreas¬ 
ing  rapidly  to  a  final  value  of  .40. 


Figure  5 

Average  Variances  of  Ability  Estimates  for  Forms 
A  and  B  of  Maximum  Information  Adaptive  Tests  and 
Bayesian  Adaptive  Tests,  as  a  Function  of 
Number  of  Items  Administered 


The  error  of  measurement  index  for  the  Bayesian  testing  strategy  was  the 
posterior  variance  of  the  ability  estimates.  These  data  are  also  shown  as  a 
function  of  test  length  in  Figure  7.  Means  of  the  Bayesian  posterior  variances 
for  the  two  alternate  forms  were  almost  identical,  decreasing  from  an  initial 
value  of  .68,  after  1  item  was  administered,  to  a  final  variance  of  .10,  after 
30  items  were  administered.  As  Figure  7  shows,  there  was  less  variance  in 
Bayesian  ability  estimates  than  in  the  maximum  likelihood  ability  estimates;  but 
the  data  show  that  both  the  Bayesian  and  maximum  information  adaptive  tests 
yielded  parallel  forms  in  terms  of  their  mean  errors  of  measurement,  at  almost 
all  test  lengths. 

Parallel  Forms  Reliability 


Optimal  scoring  method.  The  optimal  scoring  method  was  maximum  likelihood 
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Figure  6 

Variances  of  Proportion-Correct  Scores  from 
Alternate  Forms  A  and  B  of  the  Conventional 
Test,  as  a  Function  of  Number  of  Items  Administered 
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Number  of  Items  Administered 


for  the  maximum  information  testing  strategy,  Bayesian  for  the  Bayesian  testing 
strategy,  and  proportion  correct  for  the  conventional  test.  Alternate  forms 
reliability  correlations  were  computed  at  each  test  length  for  each  testing 
strategy  using  these  optimal  scores. 

Reliabilities  of  the  three  testing  strategies  as  a  function  of  test  length 
are  shown  in  Figure  8.  The  peaked  conventional  test  yielded  substantially  high¬ 
er  reliabilities  after  11  items  than  either  of  the  adaptive  tests.  The  greatest 
difference  between  reliabilities  was  r  -  .09  between  the  adaptive  and  conven¬ 
tional  tests  at  the  30-item  test  length;  the  reliabilities  of  the  adaptive  tests 
were  _r  *  .81,  compared  with  the  final  reliability  of  r_  ■  .90  for  the  convention¬ 
al  test.  The  data  in  Figure  8  show  essentially  the  same  level  and  shape  in  re¬ 
liabilities  for  the  adaptive  tests,  although  there  was  greater  fluctuation  in 
reliabilities  for  the  maximum  information  test.  The  conventional  test  reliabil¬ 
ity  was  nearly  identical  to  that  of  the  Bayesian  test  up  to  the  10-item  test 


Figure  7 

Means  of  Standard  Error  of  Measurement  from 
Parallel  Forms  A  and  B  of  Maximum  Information 
Adaptive  Tests  and  Mean  Posterior  Variance  of 
Parallel  Forms  A  and  B  of  the  Bayesian  Adaptive  Tests, 
as  a  Function  of  Number  of  Items  Administered 


Number  of  Items  Administered 


length,  but  after  that  point  the  conventional  test  reliability  increased  more 
quickly  than  that  of  the  adaptive  tests.  Although  adaptive  test  reliabilities 
showed  signs  of  leveling  off  toward  the  end  of  the  test,  the  reliability  of  the 
conventional  test  seemed  to  increase  steadily. 

Other  scoring  strategy.  Reliabilities  were  also  obtained  from  testing 
strategies  scored  by  other  than  optimal  scoring  strategies.  Four  testing-scor¬ 
ing  combinations  were  of  interest:  Bayesian-scored  maximum  information  tests, 
maximum-likelihood-scored  Bayesian  tests,  Bayesian-scored  conventional  tests, 
and  maximum-likelihood-scored  conventional  tests.  These  reliability  results  are 
shown  in  Figure  9  as  a  function  of  test  length. 
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Figure  8 

Parallel  Forms  Reliabilities  of  Optimally  Scored 
Conventional,  Bayesian,  and  Maximum  Information 
Testing  Strategies,  as  a  Function  of 
Number  of  Items  Administered 
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In  general.  Figure  9  shows  that  the  Bayesian  scoring  procedure  yielded 
higher  reliabilities  under  nonoptimal  conditions  than  the  maximum  likelihood 
scoring  procedure.  Bayesian  scoring  of  the  conventional  test  yielded  essential¬ 
ly  equivalent  reliabilities  at  every  test  length,  as  did  proportion-correct 
scoring  of  the  conventional  test.  Bayesian  scoring  of  the  maximum  information 
tests  yielded  higher  reliabilities  at  most  test  lengths  beyond  about  12  items 
than  the  optimal  scoring  strategy  for  that  test.  In  addition,  Bayesian  scoring 
of  the  maximum  information  test  tended  to  decrease  substantially  the  differences 
in  reliabilities  observed  between  the  conventional  and  adaptive  tests.  Figure  9 
shows  that  the  reliability  for  the  Baye sian-scored  maximum  information  test  was 
higher  than  that  of  the  conventional  test  for  test  lengths  from  3  to  12  items. 
The  maximum  difference  between  these  two  reliabilities  was  _r  *  .05  at  30  items, 
as  compared  to  £  -  .09  for  the  data  in  Figure  8.  These  data  indicate  that 
Bayesian  scoring  of  an  adaptive  test  may  yield  more  stable  estimates  of  ability 
than  maximum  likelihood  scoring. 

The  data  also  illustrate  the  inappropriateness  of  scoring  conventional 


Figure  9 

Parallel  Forms  Reliabilities  of  Non-Optimally  Scored 
Testing-Scoring  Strategies,  as  a  Function  of  Number 
of  Items  Administered 


tests  with  maximum  likelihood  scoring  methods.  As  Figure  9  shows,  maximum  like¬ 
lihood  scoring  of  the  conventional  test  resulted  in  extremely  low  reliabilities 
at  all  test  lengths,  reaching  a  maximum  of  only  .74  at  30  items. 

Scoring  Method  Correlations 

To  study  the  generality  of  the  findings  of  Kingsbury  and  Weiss  (1979),  in 
their  study  of  correlations  among  latent-trait  scoring  methods  in  achievement 
test  data,  comparisons  of  the  ability  estimates  from  the  various  scoring  methods 
were  made  by  correlating  scores  obtained  from  different  ways  of  scoring  the  same 
testing  strategy.  For  both  adaptive  testing  strategies,  Bayesian  scores  were 
correlated  with  maximum  likelihood  scores.  Conventional  test  comparisons  were 
made  by  correlating  proportion-correct  scores  with  Bayesian  scores,  proportion- 
correct  scores  with  maximum  likelihood  scores,  and  Bayesian  scores  with  maximum 
likelihood  scores.  For  each  testing  strategy,  one  of  the  two  alternate  forms 
was  randomly  chosen  for  these  analyses.  These  five  scoring  combinations  are 
shown  in  Figure  10  as  a  function  of  test  length. 

As  Figure  10  shows,  the  highest  correlations  were  between  Bayesian  and  pro¬ 
portion-correct  scores  of  the  conventional  test.  These  correlations  varied  in 


Figure  10 

Correlations  Between  Scoring  Methods 
for  the  Same  Alternate  Form,  as  a 
Function  of  Number  of  Items  Administered 


Number  of  Items  Administered 


value  between  1.00  for  a  1-item  test  to  .85  for  a  15-item  test,  with  most  corre¬ 
lations  between  .97  to  .99.  The  second  highest  level  of  correlation  was  between 
the  Bayesian-  and  maximum-likelihood-scored  maximum  information  test,  with  most 
correlations  between  .93  and  .95.  With  the  exception  of  the  latter  half  of  the 
correlations  between  Bayesian  and  maximum  likelihood  scores  from  the  Bayesian 
test,  there  were  few  differences  among  the  other  three  sets  of  correlations;  the 
modal  correlation  for  these  three  plots  was  .88.  The  correlations  between 
Bayesian  and  maximum  likelihood  scores  from  the  Bayesian  test  increased  steadily 
after  the  15-item  test  length  to  a  final  value  of  £  *  .94. 

Measurement  Precision  as  a  Function  of  Ability  Level 


Figure  11  shows  plots  of  the  average  standard  errors  of  measurement  as  a 
function  of  the  maximum-likelihood-derived  ability  distribution.  These  data  are 
the  reciprocal  of  the  square  root  of  the  test  information  function  for  each 
test.  The  distribution  obtained  from  this  sample  varied  from  about  -3.00  to 
+2.00  and  was  divided  into  equal  frequency  intervals  (N  20),  separately  for 
each  testing  strategy. 

The  data  indicate  that  at  no  point  on  the  ability  continuum  were  the  stan¬ 
dard  errors  of  measurement  smaller  in  the  conventional  test  than  in  the  adaptive 
tests.  In  general,  the  maximum  information  testing  strategy  yielded  smallest 
standard  errors  or  greatest  measurement  precision.  The  Bayesian  test,  when 
scored  by  maximum  likelihood,  had  poorer  measurement  precision  at  the  lower  ex- 
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Figure  11 

Average  Standard  Error  of  Measurement  as  a  Function 
of  Ability  Level  for  Conventional,  Bayesian, 
and  Maximum  Information  Testing  Strategies 
(Non-Converging  Values  Eliminated) 


Estimated  Ability  level 

treme  of  the  ability  continuum  than  did  the  maximum  information  test.  Precision 
of  measurement  for  all  the  testing  strategies  was  greatest  at  the  central  por¬ 
tion  of  the  ability  distribution  than  at  the  extremes. 

Bayesian  posterior  variance  comparisons  are  shown  in  Figure  12  as  a  func¬ 
tion  of  the  Bayesian-derived  ability  distribution.  The  distribution  varied  from 
about  -2.00  to  +2.00.  The  average  posterior  variance  was  greater  at  all  points 
along  the  ability  continuum  for  the  conventional  strategy  than  for  either  of  the 
adaptive  tests.  The  Bayesian  and  maximum  information  testing  strategies  had 
about  the  same  level  of  measurement  accuracy  in  the  center  of  the  ability  dis¬ 
tribution.  At  the  extremes  of  the  ability  continuum,  the  Bayesian  testing 
strategy  resulted  in  slightly  better  measurement  precision  than  did  the  maximum 
information  testing  strategy. 

In  both  error  of  measurement  comparisons,  there  was  poorer  measurement  at 
the  low  end  of  the  ability  distribution,  although  the  extremes— both  positive 
and  negative— were  less  precisely  measured  than  the  center  of  the  ability  con¬ 
tinuum.  The  results  indicate  that  the  adaptive  tests  yield  about  the  same  level 
of  measurement  precision  and  that  these  levels  were  greater  than  those  obtained 
from  the  conventional  test  at  all  levels  of  ability. 


Figure  12 

Average  Bayesian  Posterior  Variance  of 
Ability  Estimates  as  a  Function  of 
Ability  Level  for  Conventional,  Bayesian, 
and  Maximum  Information  Testing  Strategies 


Estimated  Ability  Level  (9) 

Discussion 


The  major  finding  in  this  study  was  that  the  conventional  test  yielded 
higher  alternate  forms  reliability  than  did  the  adaptive  tests.  However,  when 
the  maximum  information  adaptive  test  was  scored  by  the  Bayesian  scoring  algo¬ 
rithm,  reliabilities  of  short  adaptive  tests  were  higher  than  those  of  the  con¬ 
ventional  test,  and  differences  in  reliabilities  were  smaller  at  longer  test 
lengths.  Limitations  of  the  item  pool  might  account  in  part  for  the  lowered 
reliability  of  the  adaptive  tests  in  comparison  to  the  conventional  test,  since 
adaptive  tests  depend  heavily  on  the  quality  of  the  items  in  the  item  pool. 

When  an  item  pool  consists  of  highly  discriminating  items,  every  abiJity  level 
along  the  latent  trait  continuum  can  be  measured  with  a  high  degree  of  precision 
using  adaptive  tests  (McBride  &  Weiss,  1976).  When  there  are  few  items  to  mea¬ 
sure  abilities  at  the  extremes  and/or  the  available  items  are  of  low  discrimina¬ 
tion,  abilities  at  the  extremes  cannot  be  measured  accurately. 

The  item  pool  used  for  the  two  adaptive  tests  had  fewer  items  at  the  ex¬ 
tremes  of  the  ability  range  and  these  items  had  relatively  lower  discrimination 
parameters.  It  is  likely  that,  especially  at  abilities  where  there  were  fewer 
items,  the  correlations  between  ability  estimates  would  be  attenuated  and  the 
adaptive  process  would  be  at  a  disadvantage  as  testing  progressed.  The  result 
would  be  that  toward  the  end  of  testing  there  would  be  fewer  and  fewer  items 
available  at  a  given  ability  level. 


A, 
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The  adaptive  test  scoring  process  also  depends  on  accurate  parameterization 
of  items  and  on  testees  responding  according  to  a  single  latent  trait.  Experi¬ 
mental  subjects  taking  a  test  that  does  not  relate  to  any  course  they  are  taking 
and  that  does  not  count  for  a  grade  may  respond  carelessly,  with  less  than  full 
attention.  It  is  unknown  to  what  extent  the  item  parameters  are  inaccurate.  An 
optimal  research  strategy  for  comparison  of  conventional,  Bayesian,  and  maximum 
information  testing  strategies  on  the  basis  of  parallel  forms  reliability  is 
through  simulated  testing.  The  disadvantage  of  inaccurate  item  parameters,  non- 
optimal  item  pool  characteristics,  and  the  possibility  that  students  did  not 
respond  exclusively  in  accordance  with  their  ability  level  can  be  alleviated  in 
simulation. 

One  additional  factor  that  limits  the  comparison  of  the  testing  strategies 
in  terms  of  alternate  forms  reliability  correlations  is  the  distribution  of 
ability  in  the  population.  Since  values  of  the  Pearson  product-moment  correla¬ 
tions  depend  on  the  distributions  of  the  ability  estimates  involved,  different 
ability  distributions  can  result  in  different  levels  of  correlation.  Thus,  the 
reliability  correlations  confound  the  distribution  of  the  ability  estimates  with 
the  measurement  precision  of  the  testing  strategies.  Information  is  a  measure 
of  precision  of  measurement,  yielding  comparisons  of  testing  strategies  that  are 
unconfounded  by  the  distribution  of  the  ability  estimates.  As  Figure  11  shows, 
both  adaptive  testing  strategies  yielded  scores  with  greater  precision/ informa¬ 
tion  (lower  errors  of  measurement)  than  did  the  conventional  testing  strategy. 

On  the  basis  of  the  reliability  data,  few  conclusions  can  be  drawn  about 
the  relative  merits  of  the  adaptive  testing  procedures.  Bayesian  scoring  of  the 
Bayesian  test  showed  higher  reliability  than  the  maximum-likelihood-scored  maxi¬ 
mum  information  test.  Bayesian  scoring  of  the  conventional  and  maximum  informa¬ 
tion  testing  strategies  yielded  higher  reliabilities  than  maximum  likelihood 
scoring  of  the  conventional  and  Bayesian  testing  strategies.  This  might  indi¬ 
cate  either  that  the  Bayesian  scoring  algorithm  yields  more  reliable  estimates 
of  ability  or  that  it  yields  the  same  regressed  or  biased  estimate  of  ability. 
The  Bayesian  test  would  tend  to  yield  higher  parallel  forms  reliabilities  than 
the  maximum  information  testing  strategy  in  the  case  where  most  items  measuring 
abilities  at  the  extremes  of  the  distribution  are  of  lower  discrimination.  Be¬ 
cause  the  Bayesian  adaptive  test  yields  regressed  estimates  of  ability  and  re¬ 
quires  fewer  items  measuring  abilities  at  extreme  0  values,  the  Bayesian  ability 
estimates  obtained,  although  biased,  would  be  more  stable  than  ability  estimates 
from  the  maximum  information  testing  strategy. 
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A  Comparison  of  the  Accuracy  of  Bayesian  Adaptive  and 
Static  Tests  Using  a  Correction  for  Regression 


Steven  Gorman 
Department  of  the  Navy 


The  vast  changes  in  computer  technology  have  made  a  strong  impact  upon  the 
field  of  ability  measurement.  The  increased  capabilities  and  decreased  costs  of 
computer  use  have  opened  the  door  to  application  of  latent  trait  theory.  Two 
Bayesian  procedures  for  ability  estimation  have  become  popular — the  Bayes  modal 
procedure  (Samejima,  1969)  and  the  Owen  (1975)  algorithm.  Both  Bayesian  proce¬ 
dures  use  a  prespecified  distribution,  usually  the  Gaussian  normal  distribution, 
as  the  prior  variance  of  ability.  The  item  characteristic  curve  (ICC;  also 
called  the  item  response  function)  is  employed  as  the  likelihood  function.  The 
product  of  the  prior  distribution  and  the  likelihood  function  is  the  posterior 
distribution  of  ability.  These  two  procedures  can  be  used  in  either  convention¬ 
al  or  adaptive  mode. 

McBride  and  Weiss  (1976)  have  studied  Owen's  Bayesian  adaptive  procedure 
and  have  determined  that  with  this  procedure,  ability  estimates  regress  toward 
the  mean.  That  is,  high-ability  examinees  tend  to  achieve  lower  ability  esti¬ 
mates,  and  low-ability  examinees  tend  to  have  higher  ability  estimates.  Urry 
(1977)  has  suggested  a  correction,  namely,  dividing  the  Bayesian  regressed  abil¬ 
ity  estimate  by  the  test  reliability.  A  second,  potentially  more  serious,  prob¬ 
lem  is  the  reliance  upon  accurate  3-parameter  logistic  item  parameters.  Urry 
(1976)  developed  0GIVIA3,  a  computer  program  to  estimate  these  item  parameters. 
The  effectiveness  of  this  estimation  procedure  for  use  in  the  Owen  algorithm  was 
reviewed  by  Gugel,  Schmidt,  and  Urry  (1976).  0GIVIA3  has  been  revised  (Croll  & 
Urry,  in  prep.)  and  has  been  renamed  ANCILLES. 

The  purpose  of  the  present  paper  is  to  evaluate  the  effectiveness  of  two 
Bayesian  ability  estimation  procedures  with  a  correction  for  regression  using 
known  and  estimated  parameters.  Specifically,  the  studies  simulated  the  Owen 
algorithm  and  Bayes  modal  testing  methods  in  both  adaptive  and  static  mode  with 
a  correction  for  regression  using  known  parameters  and  the  parameters  estimated 
using  ANCILLES. 


Study  1: 

An  Analysis  of  the  Verbal  Scholastic 
Aptitude  Test 


Background  and  Purpose 


Lord  (1968)  applied  the  3-parameter  logistic  model  developed  by  Birnbaum 
(1968)  to  the  Verbal  Scholastic  Aptitude  Test  (VSAT).  Until  Lord's  article, 
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little  research  had  been  conducted  using  Birnbaum's  model.  However,  since  this 
article,  with  the  exception  of  a  few  articles  involving  the  maximum  likelihood 
procedure  (Bejar,  Weiss,  &  Gialluca,  1977;  Kolakowski  &  Bock,  1970,  1972;  Wood, 
Wingersky,  &  Lord,  1976),  the  overwhelming  majority  of  latent  trait  research  has 
applied  the  work  of  Birnbaum  to  adaptive  tests  and  not  to  conventional  tests. 
Samejima  (1968)  detailed  the  mechanics  of  a  Bayes  ability  estimator  based  on  a 
response  pattern  of  test  items.  She  proved  that  with  an  assumed  normal  distri¬ 
bution  of  ability  as  a  prior  distribution,  and  using  the  ICC  as  a  likelihood 
function,  the  mode  of  the  posterior  distribution  will  provide  an  absolute  maxi¬ 
mum,  which  can  be  used  as  an  ability  estimate.  Urry  (1976)  incorporated  the 
Bayes  modal  procedure  in  the  second  stage  of  his  item  parameter  estimation  pro¬ 
gram.  Owen  (1975)  developed  a  Bayesian  procedure  for  estimating  ability;  howev¬ 
er,  this  procedure  was  developed  for  the  adaptive  mode.  Bejar  and  Weiss  (1979) 
programmed  the  Owen  algorithm  for  scoring  static  tests,  but  no  data  on  its  ef¬ 
fectiveness  were  made  available. 

The  purpose  of  this  study  was  to  investigate  the  efficiency  of  the  Bayes 
modal  and  Owen's  Bayesian  ability  estimation  procedures  relative  to  a  conven¬ 
tional  rights-only  scoring.  In  particular,  the  issues  investigated  are  (1)  con¬ 
ditional  bias,  (2)  conditional  accuracy,  and  (3)  precision  of  test  scores. 


Artificial  data  were  generated  according  to  the  3-parameter  logistic  model: 

JP€  (0)  =  oi  +  (1  -  a£)  [1  +  exp  ( -1 . 7a^.  ( 0  -  i^))]"1  111 

using  the  LVGEN  program  developed  by  Urry  (1971).  This  program  provided  vectors 
of  responses,  correct  (1)  or  incorrect  (0),  for  the  simulated  examinees  (sims). 
The  test  items  used  had  the  parameters  of  the  first  80  VSAT  items  reported  in 
Lord  (1968). 

For  the  purpose  of  this  study,  it  was  assumed  that  the  item  parameters  re¬ 
ported  in  Lord's  study  were  the  actual  parameters  and  not  estimated,  as  they 
actually  were.  The  80  item  parameters  were  administered  to  2,000  sims  from  a 
normal  distribution  (mean  0,  variance  1)  generated  by  the  LLRANDOM  Computer  Pro¬ 
gram  (Learmonth  &  Lewis,  1973)  in  conjunction  with  the  LVGEN  program.  The  re¬ 
sulting  vectors  of  simulated  binary  responses  were  analyzed  by  the  ANCILLES  Pro¬ 
gram;  estimates  of  the  80  "known"  VSAT  parameters  were  the  resultant  output. 

This  allowed  a  comparison  of  the  robustness  of  the  Bayesian  ability  estimation 
programs  to  inaccuracy  in  the  item  parameter  estimates.  An  additional  2,000 
normally  distributed  sims  were  administered  the  VSAT  items.  This  permitted  com¬ 
putation  of  the  correlation  of  known  ability  with  the  various  ability  estimates 
and  the  mean  and  variance  of  raw  scores  so  that  a  Z-trans formation  could  be  com¬ 
puted.  This  allowed  comparison  of  a  simpler  scoring  procedure  based  on  classi¬ 
cal  test  theory  with  the  two  scoring  procedures  based  on  latent  trait  theory. 

Five  conditions  of  scoring  the  same  item  responses  were  examined:  (1)  Bayes 
modal  ability  estimates  based  on  known  item  parameters,  (2)  Bayes  modal  ability 
estimates  based  on  estimated  item  parameters,  (3)  Owen's  Bayesian  ability  esti¬ 
mates  based  on  known  item  parameters,  (4)  Owen's  Bayesian  ability  estimates 
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based  on  estimated  item  parameters,  and  (5)  ability  estimates  based  on  raw  score 
to  Z-score  transformations. 


To  properly  address  the  evaluation  mentioned  above  required  examination  of 
the  test  score  characteristics  as  a  function  of  ability  level.  Therefore,  the 
ability  distribution  consisted  of  100  sims  at  each  of  11  equally  spaced  values 
in  the  interval  -2.5  <  b  <  +2.5. 


For  each  of  the  five  simulated  test  administrations,  conditional  bias,  con¬ 
ditional  accuracy,  and  conditional  precision  were  estimated  from  the  100  obser¬ 
vations  at  each  ability  level  (8e). 

Conditional  bias.  This  statistic  provided  an  indicator  of  the  magnitude 
and  direction  of  the  error  between  true  ability  and  ability  estimated  by  each  of 
the  scoring  procedures  at  various  levels  of  the  trait  continuum  where 

bias  =  i  I  8  =6  -  8  ,  [2] 

e '  e  e  e 

where 

bA  =  average  bias  for  each  of  11  values  of  on  the  trait 
continuum, 

8e  =  true  ability  of  examinees  for  each  value,  and 

A 

8e  =  average  ability  estimates  for  each  value. 


Conditional  accuracy.  The  accuracy  of  the  test  scores  was  provided  by  the 
root  mean  square  error  computed  for  the  11  values  using  the  formula 


”  -  V2  *_1 

i  =  1  e 


[3] 


where 

■  «i|e 

n 


root  mean  square  error  conditional  upon  ability  level, 

100, 

known  ability  level,  and 
the  ability  estimate. 


Conditional  precision.  This  statistic  was  provided  by  the  test  score  in¬ 
formation  function.  The  information  generated  by  a  score  about  a  given  ability 
level  can  be  compared  to  the  precision  of  measurement  at  that  point.  Samejima 
(1977)  stated  that  the  inverse  of  the  square  of  information  can  be  considered  as 
the  standard  error  of  measurement  when  number  of  items  and  test  information  are 
sufficiently  large.  Birnbaum  (1968)  provides  a  formula  for  information: 


V6') 


_  2 
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3  0  0 1  |  0 
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where  (0')  is  the  information  about  0  provided  by  score  x.  Sim  scores  were 

calculated  at  each  of  11  equally  spaced  ability  levels  -2.5  x  <  +2.5;  these 
test  score  means  were  used  to  estimate  the  slope  by  fitting  a  curve  through 
three  consecutive  values.  Because  test  score  means  were  required  on  either  side 
of  the  information  point,  information  values  could  not  be  computed  for  the  ends 
of  the  continuum  (-2.5,  +2.5). 

Results 


Estimation  bias.  The  comparisons  between  the  two  Bayesian  procedures  for 
scoring  static  tests  using  estimated  parameters  and  the  raw  score  to  Z-score 
transformation  are  in  Figure  1.  The  figure  shows  that  the  absolute  value  of 
bias  for  the  Z-score  was  much  greater  than  for  the  two  Bayesian  procedures  at 
ability  level  -2.5.  The  absolute  value  of  Bayesian  score  bias  tended  to  be 
equal  to  or  lower  than  that  of  the  Z-score  along  the  entire  trait  continuum.  Of 
the  two  Bayesian  procedures,  the  Bayes  modal  bias  was  greater  at  upper  trait 
levels. 


Figure  1 

Bias  of  Three  Scoring  Procedures,  Using 
Estimated  Item  Parameters  in  a  Static  Test 


Table  1  shows  the  bias  values  of  the  two  Bayesian  procedures  under  condi¬ 
tions  of  known  and  estimated  parameters,  as  well  as  the  conventional  Z-score 
method.  The  B«yes  modal  scores  using  known  parameters  still  suffered  to  some 
degree  from  the  regression  to  the  mean  effect,  although  deviations  from  zero 
were  mostly  lower  than  the  bias  from  either  estimated  Bayes  or  Z-score  methods. 
Improvements  to  the  estimation  of  item  parameters  could  decrease  the  bias  of  the 
two  Bayesian  static  procedures  significantly. 


'I 


Ability 

Conditional  accuracy.  Figure  2  displays  the  root  mean  square  error  (RMSE) 
of  ability  estimation  for  the  two  Bayesian  algorithms  using  estimated  parameters 
and  the  Z-score  method.  All  three  methods  followed  the  same  trend  of  having 
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high  RMSE  values  at  the  low-ability  levels  and  diminishing  asymptotically  to  a 
value  of  about  .2  at  the  trait  level  +.5.  This  phenomenon  appeared  to  be  a 
function  of  the  test  itself,  with  its  emphasis  on  more  precise  measurement  at 
the  higher  ability  levels.  The  conventional  scoring  procedure  tended  to  have 
the  highest  inaccuracy,  with  two  exceptions  (ability  levels  -1.0  and  +2.5). 

Table  2  lists  the  RMSE  values  for  the  two  Bayesian  methods  using  known  and  esti¬ 
mated  parameters. 


Table  2 

Root  Mean  Square  Error  of  the  Z-Score  Method 
and  Two  Bayesian  Scoring  Methods  Using 
Estimated  and  Known  Parameters  in  a  Static  Test 


Parameters 


Estimated  Known 


Ability 

Level 

Z-Score 

Bayes 

Modal 

Owen's 

Bayes 

Modal 

Owen's 

-2.5 

1.048 

.875 

.686 

.703 

.537 

-2.0 

.652 

.567 

.412 

.486 

.365 

-1.5 

.386 

.418 

.405 

.482 

.463 

-1.0 

.263 

.325 

.361 

.370 

.425 

-0.5 

.296 

.284 

.338 

.300 

.382 

0.0 

.328 

.243 

.273 

.241 

.288 

0.5 

.281 

.203 

.221 

.191 

.213 

1.0 

.264 

.195 

.215 

.185 

.212 

1.5 

.313 

.176 

.233 

.160 

.204 

2.0 

.314 

.205 

.293 

.188 

.239 

2.5 

.191 

.266 

.280 

.252 

.231 

Conditional  precision.  The  test  score  information  values  at  the  nine  abil¬ 
ity  levels,  -2.0  to  +2.0,  for  the  two  Bayesian  scoring  methods  using  estimated 
parameters  and  the  conventional  scoring  procedure,  are  in  Figure  3;  numerical 
values  are  in  Table  3.  The  data  in  Table  3  coincide  with  two  trends  of  the  ear¬ 
lier  study  (Lord,  1968,  p.  998)  on  the  VSAT.  First,  the  data  in  Table  3  (as 
well  as  in  Table  2)  illustrate  the  more  precise  measurement  on  the  VSAT  at  upper 
ability  levels.  Second,  the  data  show  that  significant  increases  in  precision 
can  be  gained  by  using  the  Bayesian  scoring  procedures. 

The  original  study  weighted  items  based  on  the  logistic  model  and  found 
this  procedure  provided  greater  information  than  conventional  scoring.  The  av¬ 
erage  score  information  value  for  conventional  scoring  was  12.195;  the  average 
for  the  Owen  scoring  was  13.800  and  was  14.120  for  the  Bayes  modal  scoring,  with 
estimated  item  parameters  used  in  the  scoring  procedures.  Slightly  higher  aver¬ 
ages  (13.960  for  the  Owen  and  14.503  for  the  Bayes  modal  scoring)  occurred  when 
the  known  item  parameters  were  available. 

Fidelity.  Fidelity  coefficients,  the  correlations  of  the  known  ability  of 
2,000  sims  from  a  normal  population  with  their  estimated  abilities,  were  comput¬ 
ed  from  the  various  test  scoring  methods  and  are  in  Table  4.  Although  the  in¬ 
crease  in  the  correlation  is  only  roughly  .02  for  the  two  Bayesian  methods  over 
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Table  4 

Correlation  of  Known  Ability  with  Ability 
Estimates  for  a  Conventional  Z-Score 
Scoring  Method  and  Two  Bayesian  Scoring 
Methods  Using  Known  and  Estimated 
Parameters  in  a  Static  Test 


Scoring  Method  r 


Conventional  Z-Score  Transformation  .941 

Bayes  Modal 

Estimated  Parameters  .959* 

Known  Parameters  .960* 

Owen's  Bayesian 

Estimated  Parameters  .958* 

Known  Parameters  .958* 


*Values  significantly  different  from  conven¬ 
tional  Z-score  transformation  r  at  p  <  .0001. 

ecy  formula)  to  a  fidelity  coefficient  of  a  120-item  test  scored  conventionally. 
Also  of  interest  is  the  fact  that  the  fidelity  coefficients  computed  using 
either  Bayesian  procedure  with  known  item  parameters  were  not  significantly  dif¬ 
ferent  from  the  fidelity  coefficients  computed  from  Bayesian  scoring  with  esti¬ 
mated  item  parameters.  This  attests  to  the  robustness  of  the  Bayesian  scoring 
procedure  to  errors  in  item  parameter  estimation. 

Conclusions.  It  is  apparent  that  improvements  in  the  measurement  of  exam¬ 
inees  on  conventional  tests  can  be  realized  by  the  use  of  mathematical  scoring 
procedures  that  are  based  upon  latent  trait  theory.  Bias  seems  to  be  dimin¬ 
ished,  and  test  score  accuracy  and  precision  are  improved  with  these  two  Bayes¬ 
ian  scoring  procedures,  compared  to  the  conventional  scoring  method. 

Study  2: 

An  Analysis  of  the  Effect  of  the 
Correction  for  Regression  and  Parameter  Estimation 
Errors  Upon  Two  Bayesian  Adaptive  Testing  Procedures 


Purpose 


The  present  study  simulated  an  adaptive  test  using  both  Owen's  Bayesian 
procedure  and  the  Bayes  modal  procedure.  The  research  attempted  to  determine 
the  effect  of  item  parameter  estimation  errors  upon  the  test  characteristics  as 
a  function  of  ability  level.  In  addition,  this  study  investigated  the  effect  of 
a  correction  for  regression  applied  to  the  ability  estimates  obtained  using  the 
Owen  algorithm.  The  Bayes  modal  procedure  already  incorporates  this  regression 
correction. 

Owen's  Bayesian  Procedure  and  the  Correction  for  Regression 

The  Bayesian  adaptive  ability  estimation  procedure  has  been  well  documented 


-  43  - 


elsewhere  (McBride  &  Weiss,  1976;  Owen,  1975)  and  will  not  be  reported  here. 
However,  to  understand  the  correction,  a  brief  conceptual  description  is  in  or¬ 
der.  The  procedure  assumes  a  normal  distribution  of  the  ability  estimates  with 
mean  0  and  variance  1.  The  item  bank  is  then  scanned  to  identify  the  item  that 
will  minimize  the  expectation  of  the  posterior  variance  of  the  distribution  if 
administered.  That  item  is  then  administered,  and  a  new  ability  estimate  (mean 
of  posterior  distribution)  and  variance  about  that  estimate  are  computed.  The 
ability  estimate  is  then  used  as  the  prior  mean,  and  an  item  is  again  selected 
to  minimize  the  expected  value  of  the  variance  of  the  posterior  distribution. 
This  procedure  is  repeated  iteratively. 

A  correction  for  regression  is  applied  to  the  final  ability  estimate.  The 
correction  consists  of  dividing  the  final  ability  estimate  by  what  Urry  (1977) 
refers  to  as  the  test  reliability.  This  reliability  is  1.0  minus  the  Bayesian 
posterior  variance,  and  this  value  obviously  will  differ  for  each  individualized 
test.  Urry  believes  that  more  accurate  measurement  is  attained  by  terminating 
adaptive  tests  based  on  a  fixed  posterior  variance,  rather  than  a  fixed  number 
of  items.  However,  Urry  (1977)  concedes  that  this  correction  should  be  effec¬ 
tive  for  both  fixed  and  variable-length  tests.  This  study  investigates  the 
fixed-length  test  only. 

Bayes  Modal  Adaptive  Procedure 

The  Bayes  modal  adaptive  ability  estimation  procedure  developed  for  this 
study  consisted  of  two  algorithms— one  to  estimate  ability  and  one  to  select 
appropriate  items  to  be  administered.  The  ability  estimation  algorithm  was 
based  on  the  Bayesian  scoring  procedure  developed  by  Samejima,  using  the  item 
response  function  and  an  assumption  of  a  normal  distribution  of  ability.  Urry 
(1976)  uses  this  procedure  in  the  second  iterative  stage  of  his  item  parameter 
estimation  procedure.  The  item  selection  procedure  chooses  that  item  which  pro¬ 
vides  the  highest  level  of  item  information  for  the  current  ability  estimate. 

The  item  response  function  for  all  administered  items  is  computed.  The  product 
of  all  item  response  functions  and  the  assumed  normal  density  function  is  the 
posterior  distribution;  the  mode  of  this  distribution  is  the  ability  level  esti¬ 
mate.  This  value  is  then  unregressed  using  the  same  correction  as  stated  earli¬ 
er.  However,  unlike  the  Owen  procedure,  the  corrected  estimate  is  then  used  as 
the  starting  point  for  the  next  iteration  of  item  selection. 

Design  of  the  Study 

Two  "ideal"  banks  were  generated,  each  consisting  of  101  items  at  equal 
increments  of  b  »  .05  over  the  range  -2.5  <  Is  +2.5.  One  bank  used  items  whose 
item  discriminations  were  set  at  a  =  1.6;  the  other,  at  a  =  .8.  The  item  param¬ 
eters  were  estimated  by  the  ANCILLES  program  on  a  group  of  50  items  based  on  the 
responses  of  2,000  sims.  The  procedures  differed  from  Study  1  in  that  the  items 
were  scrambled  with  the  parameters  from  item  banks  of  another  study  (Gorman,  in 
prep.).  The  analysis  was  based  upon  three  test  characteristics  as  a  function  of 
ability  level — bias,  accuracy,  and  precison — as  documented  in  Study  1. 

Results 

Conditional  bias.  Figure  4  displays  the  score  bias  from  the  25-item  adap- 
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tive  test  employing  the  Owen  algorithm,  with  and  without  the  correction  for  re¬ 
gression,  and  the  Bayes  modal  procedure.  The  three  lines  represent  the  bias  in 
the  adaptive  procedures  using  the  item  bank  with  item  discriminations  of  a  » 

1.6,  based  on  estimated  parameters.  The  Owen  procedure  with  the  correction  pro¬ 
vided  the  least  bias. 


Figure  4 

Effect  of  Regression  Correction  Upon  Bias  of 
25-Item  Bayes  Modal  and  Owen  Adaptive  Tests 
(a  *  1.6)  with  Estimated  Parameters 


Table  5  shows  the  effect  of  regression  upon  the  ability  estimates  from  the 
Owen  procedure  using  known  and  estimated  parameters.  An  interesting  result  is 
that  the  regression  phenomenon  was  more  prevalent  when  known  parameters  were 
usfed  in  the  Owen  scoring  with  a  correction  than  with  estimated  parameters  using 
the  same  correction.  This  may  be  due  to  sampling  errors  in  parameter  estimation 
working  in  the  preferred  direction  on  this  criterion. 

Using  the  less  discriminating  item  bank  (a  ■  .8),  the  regression  was  more 
extreme,  but  the  correction  using  estimated  parameters  again  adequately  compen¬ 
sated.  The  regression  correction  was  less  effective  when  using  known  parame¬ 
ters. 


The  Bayes  modal  adaptive  test  did  not  fare  as  well  as  the  Owen  adaptive 
test.  This  ran  be  seen  in  Table  6,  which  lists  the  bias  for  the  two  item  banks 
under  conditions  of  known  and  estimated  parameters.  With  known  parameters,  the 
bias  was  tolerable  with  the  better  item  bank.  The  bias  under  the  other  three 
conditions  was  significantly  greater. 

Conditional  accuracy.  Table  7  shows  the  effect  of  the  regression  correc- 
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Table  5 

Effect  of  Regression  Correction  Upon  Bias  of  the  25-Item 
Owen's  Adaptive  Test  with  Estimated  and  Known  Parameters 
for  Two  Item  Banks 


Item  Bank 

Parameters 

and 

Estimated  Known 

Ability  Level 

Corrected  Uncorrected  Corrected  Uncorrected 

a*. 8  Item  Bank 


-2.5 

.063 

.468 

-.055 

.416 

-2.0 

.021 

.351 

-.167 

.237 

-1.5 

-.036 

.222 

-.127 

.179 

-1.0 

-.029 

.146 

-.124 

.091 

-0.5 

-.053 

.042 

-.067 

.043 

0.0 

.008 

.007 

.008 

.006 

0.5 

.049 

-.047 

.102 

-.019 

1.0 

.011 

-.165 

.075 

-.143 

1.5 

-.027 

-.283 

.154 

-.186 

2.0 

.046 

-.306 

.268 

-.205 

2.5 

a=*1.6  Item  Bank 

.036 

-.403 

.275 

-.314 

-2.5 

.091 

.326 

-.008 

.242 

-2.0 

.019 

.207 

-.077 

.125 

-1.5 

.055 

.193 

.001 

.151 

-1.0 

-.004 

.093 

-.046 

.061 

-0.5 

-.040 

.013 

-.068 

-.009 

0.0 

.031 

.028 

.007 

.006 

0.5 

.115 

.055 

.071 

.010 

1.0 

.054 

-.046 

.064 

-.050 

1.5 

.071 

-.077 

.117 

-.059 

2.0 

.054 

-.140 

.161 

-.077 

2.5 

.041 

-.221 

.153 

-.150 

tion  upon  the  root  mean  square  error  (RMSE)  of  the  25-item  Owen  adaptive  test. 
The  average  RMSE  value  for  the  Owen  ability  estimates  using  known  item  parame¬ 
ters  without  the  correction  was  .225;  using  estimated  item  parameters  with  the 
correction,  .233;  using  known  item  parameters  with  the  correcton,  .241;  and  us¬ 
ing  estimated  item  parameters  without  the  correction,  .225. 

The  a  =  .8  item  bank  followed  this  same  trend,  only  to  a  greater  degree, 
with  the  exception  that  the  highest  average  RMSE  value  was  with  the  Owen  proce¬ 
dure  using  known  item  parameters  and  corrected  for  regression.  This  result  is 
counter  to  the  expected  result.  The  reason  for  this  may  again  be  due  to  errors 
in  item  parameter  estimation  favorable  to  the  Owen  procedure.  Another  trend  for 
both  item  banks  was  that  the  RMSE  values  were  lowest  about  the  mean  and  in¬ 
creased  in  magnitude  as  a  function  of  distance  from  the  mean. 

Table  8  lists  the  RMSE  for  the  Bayes  modal  adaptive  test.  On  the  item  bank 
with  a  ■  .8  using  estimated  paiameters,  the  conditional  accuracy  was  poorer  than 
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Table  6 

Bias  of  the  25-Item  Bayes  Modal  Adaptive  Test 
Using  Estimated  and  Known  Parameters, 
with  Two  Item  Banks 


Ability 

Level 

l 

Item 

Bank 

a  * 

.8 

a  * 

1.6 

Estimated 

Parameters 

Known 

Parameters 

Estimated 

Parameters 

Known 

Parameters 

-2.5 

.575 

.213 

.324 

.115 

-2.0 

.382 

.133 

.232 

.085 

-1.5 

.262 

.101 

.228 

.156 

-1.0 

.094 

.027 

.143 

.097 

-0.5 

.049 

.035 

.058 

.047 

0.0 

.066 

.043 

.143 

.059 

0.5 

.087 

.084 

.099 

.066 

1.0 

-.049 

-.013 

-.067 

-.002 

1.5 

-.067 

-.101 

-.002 

-.038 

2.0 

-.195 

-.092 

-.208 

-.058 

2.5 

-.343 

-.080 

-.251 

-.038 

Table  7 

Effect  of  Regression  Correction  Upon  Root  Mean  Square  Error 
of  the  25-Item  Owen  Adaptive  Test  with  Estimated  and 

Known  Parameters  for  Two  Item  Banks 

Item  Bank 
and 

Ability  Level 

Parameters 

Estimated 

Known 

Corrected 

Uncorrected 

Corrected 

Uncorrected 

a».8  Item  Bank 

-2.5 

.370 

.556 

.413 

.532 

-2.0 

.423 

.497 

.454 

.417 

-1.5 

.404 

-403 

.384 

.347 

-1.0 

.432 

.388 

.433 

.349 

-0.5 

.368 

.305 

.387 

.310 

0.0 

.352 

.291 

.390 

.313 

0.5 

.447 

.370 

.420 

.325 

1.0 

.401 

.372 

.445 

.376 

1.5 

.351 

.404 

.416 

.357 

2.0 

.339 

.416 

.531 

.411 

2.5 

.395 

.512 

.541 

.475 

a»1.6  Item  Bank 

-2.5 

.2907 

.4054 

.2440 

.3181 

-2.0 

.2411 

.2973 

.2481 

.2434 

-1.5 

.2629 

.3020 

.2202 

.2496 

-1.0 

.1864 

.1926 

.2088 

.1927 

-0.5 

.2223 

.1980 

.2353 

.2025 

0.0 

.2297 

.2072 

.2133 

.1906 

0.5 

.2351 

.1945 

.2386 

.2034 

1.0 

.2142 

.1953 

.2399 

.2118 

1.5 

.2069 

.1923 

.2512 

.2060 

2.0 

.2333 

.2452 

.2670 

.2033 

2.5 

.2487 

.2988 

.2861 

.2558 

Table  8 

Root  Mean  Square  Error  of  the  25-Item  Bayes  Modal  Adaptive  Test 
Using  Estimated  and  Known  Parameters,  with  Two  Item  Banks 


Ability 

Level 

Item 

Bank 

a  = 

.8 

a  = 

1.6 

Estimated 

Parameters 

Known 

Parameters 

Estimated 

Parameters 

Known 

Parameters 

-2.5 

.783 

.445 

.589 

.393 

-2.0 

.593 

.431 

.393 

.327 

-1.5 

.428 

.373 

.388 

.279 

-1.0 

.366 

.386 

.290 

.245 

-0.5 

.411 

.384 

.261 

.264 

0.0 

.412 

.351 

.339 

.290 

0.5 

.426 

.414 

.268 

.233 

1.0 

.321 

.327 

.234 

.228 

1.5 

.328 

.395 

.172 

.215 

2.0 

.320 

.368 

.269 

.206 

2.5 

.447 

.340 

.364 

.210 

Table  9 

Test  Score 

Information 

of  Two  25-Item  Bayesian 

Tests , 

Using  Known  and  Estimated 

l  Parameters, 

with  Two  Item  Banks 

Item 

Bank 

Adaptive  Test 
and 

a  = 

.8 

a  = 

1.6 

Known 

Estimated 

Known 

Estimated 

Ability 

Level 

Parameters 

Parameters 

Parameters 

Parameters 

Owen's  Bayesian 

-2.0 

2.591 

4.738 

15.776 

17.847 

-1.5 

5.519 

8.359 

14.786 

22.501 

-1.0 

5.311 

6.475 

23.878 

21.238 

-0.5 

7.975 

8.726 

21.079 

21.571 

0.0 

9.808 

9.009 

25.752 

28.516 

0.5 

5.181 

6.974 

26.434 

21.809 

1.0 

5.425 

6.021 

22.470 

21.041 

1.5 

8.975 

9.519 

26.369 

24.226 

7.0 

9.863 

5.897 

18.315 

23.473 

Bayes  Modal 

-2.0 

1.325 

4.210 

5.067 

9.898 

-1.5 

2.850 

5.822 

5.435 

13.496 

-1.0 

4.476 

5.689 

8.919 

13.241 

-0.5 

4.859 

6.425 

11.277 

11.670 

0.0 

6.347 

8.906 

10.608 

12.359 

0.5 

4.982 

5.695 

17.014 

18.389 

1.0 

7.565 

6.504 

24.089 

15.992 

1.5 

6.621 

5.594 

21.752 

19.452 

2.0 

5.142 

7.703 

8.609 

23.604 

I. 
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with  the  Owen  procedure.  On  the  other  hand,  with  the  same  item  bank  using  known 
parameters,  accuracy  was  greater  with  the  Bayes  modal  procedure.  With  the  bet¬ 
ter  item  bank,  the  Owen  procedure  was  superior  to  the  Bayes  modal  on  this  crite¬ 
rion. 


Conditional  precision.  Table  9  lists  values  of  score  information  for 
25-item  tests  with  both  Bayesian  adaptive  methods  and  two  item  banks.  The  item 
parameter  estimation  errors  rearranged  the  test  score  distribution  and,  hence, 
its  information.  The  Owen  procedure  provided  more  information  about  the  mean 
and  dropped  off  somewhat  at  the  extremes.  The  Bayes  modal  procedure  provided 
considerably  less  information;  hence,  the  standard  error  of  measurement  was 
larger  at  all  ability  levels. 

Conclusions 


The  correction  for  regression  effectively  diminished  the  regression  to  the 
mean  effect.  Fortunately,  the  errors  of  parameter  estimation  provided  by  ANCIL- 
LES  worked  in  favor  of  less  biased  measurement.  The  accuracy  of  the  Owen  adap¬ 
tive  fixed-length  test  with  this  correction  was  somewhat  poorer  with  parameters 
estimated  by  ANCILLES  than  with  known  parameters.  This  drop  in  accuracy  did  not 
appear  to  be  severe  enough  to  discount  the  Owen  procedure  for  adaptive  testing. 
The  Bayes  modal  adaptive  procedure  as  implemented  in  this  study  needs  further 
work  to  equal  or  surpass  the  Owen  algorithm,  even  with  more  accurately  estimated 
parameters . 
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Discussion:  Session  1 


Brian  Waters 
Air  University 


The  Department  of  Defense  enlists,  classifies,  and  assigns  hundreds  of 
thousands  of  men  and  women  annually,  with  test  scores  a  major  determinant  of 
these  decisions.  The  testing  function  must  be  performed  more  efficiently,  accu¬ 
rately,  and  equitably;  and  computerized  adaptive  testing  (CAT)  provides  the 
promise  of  greatly  improved  large-scale  testing  efficiencies.  Work  such  as  that 
reported  by  this  session's  authors  on  various  adaptive  testing  strategies  is 
therefore  important. 

These  papers  represent  two  lines  of  needed  research — basic  and  more  applied 
research  on  CAT.  We  still  have  many  theoretical  questions  best  addressed  by 
simulation  studies  such  as  Gorman's,  as  well  as  myriad  practical  problems,  which 
are  best  investigated  with  live  data  empirical  studies  such  as  McBride's  and 
Johnson  and  Weiss's.  I  enjoyed  reading  each  of  these  papers,  particularly  the 
mental  exercise  of  analyzing  the  contradictory  results  of  the  latter  two  stud¬ 
ies. 

The  primary  result  from  the  Johnson  and  Weiss  paper  and  the  McBride  paper 
that  caught  my  attention  were  the  opposite  results  obtained  on  McBride's  Figure 
1  and  Johnson  and  Weiss's  Figure  8.  These  two  analyses  of  Bayesian  adaptive 
testing  versus  conventional  testing  both  examined  parallel  forms  reliability  as 
a  function  of  test  length.  McBride's  results  were  consistent  with  the  bulk  of 
similar  work  done  in  the  past,  but  the  Johnson  and  Weiss  results  were  startling¬ 
ly  different.  The  latter  paper  showed  the  conventional  test  yielding  consist¬ 
ently  higher  reliabilities  after  about  10  items.  In  an  effort  to  explain  this 
difference  in  two  similarly  designed  studies,  Tom  Warm  of  the  Coast  Guard  Insti¬ 
tute,  Jim  McBride,  Marilyn  Johnson,  Brad  Sympson,  and  I  tried  to  determine  what 
could  have  led  to  the  conflicting  results.  Figure  1  shows  a  plot  of  the  results 
from  the  two  studies  on  comparable  data.  My  tentative  conclusions  attribute  the 
differences  to  either  the  parameterization  process,  the  test  difficulty#  or  the 
examinee  characteristics  differences.  My  best  "guess"  is  that  the  former  is  the 
major  cause  of  the  contradictory  results. 

McBride  designed  an  item  pool  that  was  extraordinary  by  any  standards.  In 
effect,  he  followed  Urry's  guidelines  for  selection  of  item  characteristics  for 
an  adaptive  test.  All  £  parameters  were  more  than  .80  and  all  £  parameters  were 
less  than  .30.  His  average  £  values  for  the  conventional  and  adaptive  tests 
were  1.40  and  1.20,  respectively.  In  addition,  McBride's  items  were  parameter¬ 
ized  on  a  group  of  4,000  examinees  from  a  directly  comparable  population  and 
produced  a  nearly  rectangular  distribution  of  information. 


Figure  1 

Alternate  Forms  Reliability  Coefficients 
from  the  McBride  and  Johnson  and  Weiss  Studies 


Johnson  and  Weiss  had  test  item  a_  parameters  as  low  as  .65,  with  a  mean  of 
1.05  and  a  range  of  .65  to  2.25  on  the  conventional  test.  The  adaptive  test  a_ 
parameter  range,  however,  was  .04  to  3.00,  with  a  mean  of  .76.  Particularly  in 
the  extreme  ranges  of  ability,  some  of  the  items  were  adding  practically  no  in¬ 
formation  to  the  adaptive  test.  The  items  were  parameterized  on  far  fewer  exam 
inees  (82  to  1,861  with  a  median  of  about  300),  and  the  item  distribution  was 
much  more  peaked.  As  McBride  (paraphrasing  Urry,  1970)  stated  in  his  paper,  "a 
good  tailored  test  design  is  superior  [to  conventional  testing],  provided  that 
highly  discriminatory  test  items  are  available."  From  a  purely  psychometric 
viewpoint,  I  would  expect  McBride's  items  to  be  more  effective  in  an  adaptive 
test  as  compared  to  ^  conventional  test  and  to  have  more  stable  item  parameter 
estimates  than  Johnson  and  Weiss's. 


These  contradictory  results  concern  me  in  another  way.  Johnson  and  Weiss' 
data  come  from  a  much  more  "real  world"  situation.  McBride's  careful  item  se¬ 
lection,  parameterization,  and  design  are  to  be  highly  commended;  however,  in 
many  applications,  the  "ideal"  item  pool  he  used  is  simply  just  not  obtainable. 
Unfortunately,  most  of  us  will  be  faced  with  a  pool  more  like  that  of  Johnson 
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and  Weiss.  If,  in  fact,  their  results  become  typical,  the  practical  application 
of  adaptive  testing  is  threatened.  The  Johnson  and  Weiss  study  thus  needs  rep¬ 
lication. 

McBride's  study  was  exceptionally  well  done.  It  is  nice  to  see  data  from 
the  real  world  rather  than  from  just  "Psychology  101"  students.  I  would  have 
liked  to  have  seen  test  statistics,  including  reliability,  reported  for  the 
50-item  criterion  test  used  in  the  validity  analyses.  McBride's  results  of  a 
large  increase  in  reliability  with  no  significant  change  in  validity  is  not 
atypical.  More  information  on  the  criterion  measure  would  have  helped  the  read¬ 
er  conjecture  why  the  validities  did  not  increase  with  the  reliability  coeffi¬ 
cients.  My  feeling  is  that  it  is  related  to  the  fact  that  the  correlation  coef¬ 
ficient  only  uses  mean  values  and  that  the  criterion  measure  was  a  conventional 
test  score.  If,  as  the  errors  of  measurement  suggest,  the  adaptive  scores  had 
less  error  variance  and  more  true  variance  in  them,  then  I  would  expect  less 
correlation  between  adaptive  and  conventional  scores  than  between  two  conven¬ 
tional  scores.  The  additional  true  variance  would  be  unique  to  the  adaptive 
scores,  whereas  some  of  the  error  variance  would  be  common,  by  chance,  to  the 
conventional  test  scores. 

In  a  recent  conversation  with  McBride,  I  discovered  that  since  the  confer¬ 
ence  he  has  acquired  another  criterion  score  on  the  examinees  from  this  study. 

He  reports  that  the  validity  coefficients  on  the  adaptive  tests  were  consist¬ 
ently  higher  (up  to  .19)  than  the  conventional  test  validities,  with  the  largest 
gain  at  shorter  test  lengths. 

Before  leaving  these  two  papers,  I  would  like  to  comment  briefly  on  Mc¬ 
Bride's  conclusion  that  fixed  test  length  was  as  reliable  as  variable  test 
length.  I  have  a  difficult  time  conceptually  accepting  this  result,  if  for  no 
other  reason  than  that  I  believe  that  individual  differences  must  make  a  differ¬ 
ence.  Practically,  fixed  length  is  certainly  logistically  and  legally  more  re¬ 
alistic,  which  are  perfectly  valid  reasons  for  using  this  testing  strategy. 
Theoretically,  however,  I  feel  that  potential  efficiencies  must  exist  with  vari¬ 
able  length.  As  Richard  Anderson  of  the  University  of  Illiniois  has  said,  "You 
can't  let  bad  data  ruin  a  good  theory." 

Gorman's  paper  really  consisted  of  two  independent  monte  carlo  simulation 
studies  that  followed  up  work  suggested  by  McBride  and  Weiss  (1976)  and  Urry 
(1977).  It  focused  on  the  relative  merits  of  two  Bayesian  models — the  Owen  al¬ 
gorithm  and  Samejima's  Bayes  modal  procedure — and  conventional  rights-only  scor¬ 
ing.  Gorman's  first  study  evaluated  the  efficiency  of  the  two  Bayesian  models 
and  conventional  scoring  on  static  (i.e.,  nonadaptive,  or  conventional)  tests 
using  three  measures  of  efficiency:  (1)  average  bias,  (2)  average  accuracy,  and 
(3)  test  score  precision.  He  generated  2,000  simulated  examinees  (sims)  from  a 
normal  distribution  (mean  0,  variance  1)  and  80  item  scores  for  each  sim  for 
both  Bayesian  and  conventional  sim  group  members.  He  then  used  ANCILLES  to  ana¬ 
lyze  the  data. 

Gorman's  first  study  results  showed  considerably  less  bias  of  estimation 
for  the  two  Bayesian  procedures  than  for  the  conventional  scoring  at  all  points 
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on  0  except  at  0  =  -.5  to  +.5,  with  the  Owen  scoring  generally  better  than  the 
Bayes  modal  scoring  at  the  lower  0  levels  and  vice  versa  at  the  higher  0  levels. 

On  his  second  measure  of  efficiency,  conditional  accuracy,  again  the  con¬ 
ventional  scoring  yielded  less  accurate  parameter  estimation  than  the  two  Bayes¬ 
ian  methods.  Little  accuracy  differences  between  the  latter  two  methods 
evolved,  although  the  Owen  procedure  did  show  slightly  more  error  than  the  Bayes 
modal  model  for  most  of  the  ability  continuum. 

Gorman's  conditional  test  score  precision  measure  showed  substantial  gains 
for  both  latent  trait  scoring  models  over  conventional  scoring,  with  nearly 
identical  results  between  the  two  mathematical  models.  He  also  found  statisti¬ 
cally  significant,  though  relatively  small  (.02),  gains  in  fidelity  coefficients 
in  favor  of  the  latent  trait  models.  He  concluded  from  his  first  study  that 
measurement  improvements  can  be  realized  through  the  use  of  the  latent-trait- 
theory-based  models  to  score  static  tests. 

Gorman's  second  study  was  a  follow-up  of  Urry's  (1977)  suggestion  on  Mc¬ 
Bride  and  Weiss's  (1976)  study  results,  which  documented  the  regression  to  the 
mean  effect  using  Owen's  procedure.  Urry  suggested  dividing  the  Bayesian  re¬ 
gressed  ability  estimate  by  the  test  reliability  (the  Bayesian  posterior  vari¬ 
ance  squared).  Gorman  followed  this  procedure  in  a  monte  carlo  simulation  using 
ANCILLES,  the  revision  of  OGIVIA3,  for  evaluating  the  efficiency  of  the  Bayes 
modal  and  the  Owen  models  with  the  correction  for  regression  applied. 

Gorman's  study  results  showed  the  Owen  procedure  to  be  generally  preferable 
to  the  Bayes  modal  procedure  in  terms  of  conditional  bias,  conditional  accuracy, 
and  conditional  precision  when  the  correction  for  regression  was  used. 

Considering  the  work  performed  on  differences  between  the  various  computer 
program  ability  estimates,  such  as  Bejar  and  Weiss  (1979)  showed  for  different 

maximum  likelihood  and  Bayesian  procedures,  I  am  glad  to  see  studies  such  as 

Gorman's  being  done.  Somehow,  we  need  to  settle  the  arguments  of  the  advantages 
and  disadvantages  of  the  various  models  whereby  the  results  of  each  study  are 
questioned  by  the  proponents  of  other  models.  Algorithm  comparisons  with  known 

parameters  are  an  effective  way  to  address  this  research  question. 

As  a  final  observation  on  the  subject  of  this  session,  I  was  very  pleased 
to  see  two  empirical  live-data  studies  done.  Although  basic  research  is  impor¬ 
tant,  many  of  our  funding  agencies  respond  more  to  data  from  real  people  as  op¬ 
posed  to  simulees.  I  would  suggest  that  future  empirical  studies  include  cost 
data  in  their  battery  of  dependent  variables.  There  has  been  a  dearth  of  these 
data,  and  they  have  substantial  impact  on  a  funding  agency's  decisions.  I  rec¬ 
ommend  that  proposals  for  future  empirical  adaptive  testing  studies  should  all 
include  cost  variables.  In  the  competition  for  limited  research  dollars,  this 
information  could  well  be  the  difference  between  obtaining  funding  and  not;  but 
more  importantly,  the  information  is  important  for  us  as  adaptive  testing  re¬ 
searchers  . 
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A  Validity  Study  of  an  Adaptive  Test  of  Reading  Comprehension 


Lutz  F.  Hornke  and  Michael  P.  Sauter 
University  of  Dusseldorf 


Adaptive  means  that  a  test  adapts  to  the  testee's  proficiency  level  in  the 
proper  "can-do"  sense.  A  fair  number  of  items  are  placed  at  the  testee's  dis¬ 
posal;  and  solely  by  means  of  tactical  rules,  the  testees  self-select  their  own 
individual  subset  of  items.  To  achieve  this,  their  previous  responses  are  used 
to  help  in  making  item-to-item  decisions.  In  addition,  restrictions  on  test 
time  are  imposed  to  insure  unidimensional  interpretations. 

In  the  literature  many  variants  of  adaptive  schemes  are  described  and  dis¬ 
cussed  (see  Hornke,  1976,  1977,  1979a,  1979b,  1979c;  Hornke,  Sauter,  Suessmilch, 
&  Burghoff,  1979;  Lord,  1971;  Weiss,  1974;  1975;  Weiss  &  Betz,  1973;  Wood, 

1973).  Generally  speaking,  the  idea  utilized  is  that  of  branching  from  item  to 
item  or  between  groups  of  items  utilized:  The  item  someone  is  branched  to  is 
made  contingent  on  his/her  response(s)  to  earlier  item(s).  Thus,  whenever  a 
testee  answers  an  item  correctly,  he/she  is  presented  with  more  a  difficult  one 
on  the  assumption  that  his/her  proficiency  level  at  this  intermediate  stage  is 
somewhat  higher  than  that  displayed  in  the  item  just  mastered.  The  contrary 
holds  for  incorrect  responses.  The  complexity  and  variety  of  branching  rules  is 
not  limited  (see  Hornke,  1976;  Weiss,  1978).  The  more  flexible  the  branching 
technology,  the  more  adaptive  the  decision  process  will  be,  and  this  yields  very 
reliable  information  about  a  testee's  proficiency  and  his/her  can-do  potential. 

The  term  branching  technology  is  used  here  intentionally  because  many  adap¬ 
tive  testing  projects  already  use  computers.  According  to  highly  sophisticated 
estimation  procedures  based  on  probabilistic  mathematical  response  models  (see 
Fischer,  1978;  Lord  &  Novick,  1969),  items  are  deliberately  retrieved  from  a 
larger  pool.  These  approaches  use  item  parameters  to  estimate  a  person's  proba¬ 
ble  standing.  After  several  cycles  of  item  administration  and  parameter  estima¬ 
tion,  a  person  parameter  emerges  that  confidently  reflects  an  individual's  pro¬ 
ficiency  level.  Since  items  and  persons  are  calibrated  on  the  same  scale,  by 
looking  at  those  items  (i.e.,  behaviors),  the  parameters  of  which  lie  in  the 
vicinity  of  the  person  parameter,  interpretations  are  readily  available. 

Computer  terminals  and  micro-computers  are  quite  costly,  however,  so  that 
paper-and-pencil  versions  deserve  some  attention.  The  basis  of  their  measure¬ 
ment  is  somewhat  less  stringent  compared  with  flexible  computer-assisted  tests; 
but  when  properly  designed,  they  should  allow  equivalent  or  even  better  measure¬ 
ment  precision  than  conventional  tests  (see  Hornke,  1979b,  Hornke  et  al.,  1979). 
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The  test  booklet  may  look  the  same  as  that  for  conventional  tests;  the  differ¬ 
ence  is  that  the  testee  is  asked  to  use  a  special  pen  for  marking  his/her  an¬ 
swer.  He/she  has  to  pass  this  lightly  over  a  bracketed  field  next  to  the  chosen 
answer.  Chemicals  then  react  and  render  visible  the  number  of  an  item  to  be 
attempted  next.  By  following  these  numbers  as  they  appear,  a  testee  is  branched 
through  the  item  set  (see  Hornke,  1979a;  Sauter,  1978,  1979;  Sauter  &  Hornke, 
1979).  The  testee  is  intended  to  be  guided  to  just  that  subset  of  items  that 
tells  something  about  his/her  can-do  level,  while  leaving  out  all  the  other  bor¬ 
ing  or  otherwise  frustrating  items.  Since  a  testee  zig-zags  through  a  pyramidal 
item  arrangement,  he/she  will  finally  end  in  a  score  category,  a  self-evaluating 
feature  of  this  tactical  test  design. 

Thus,  with  branching  tactics,  flexible,  fair,  self-scoring,  and  interpre¬ 
table  tests  are  at  hand.  Since  any  mathematical  response  model  or  pyramidal 
pencil-and-paper  test  rests,  respectively,  on  the  quality  of  the  items  and  the 
model  or  arrangement  more  successful  assessment  is  guaranteed  as  long  as  quality 
levels  are  maintained.  Even  conventional  tests,  however,  require  some  degree  of 
item  validity  and  reliability,  unless  any  interpretation  is  better  than  random 
guesswork.  Whether  and  how  adaptive  tests  will  and  should  be  used  is  still  an 
open  research  question. 

Adaptive  Test  Designs 

Individualization  is  a  concept  that  meets  approval  on  many  different  sides. 
To  some  extent,  assessing  an  individual  in  his/her  own  right  solely  by  what 
he/she  is  doing  seems  fair.  Saving  time  by  asking  nonsuper fluous  questions  cap¬ 
italizes  more  on  the  economy  and  less  on  the  psychology  of  testing,  though  in 
that  area,  too,  something  might  be  gained.  Reduction  of  the  stress  induced  by 
testing,  maintenance  of  motivation,  and  lack  of  boredom  are  but  a  few  psycholog¬ 
ical  effects.  So  far  very  little  is  known  about  these  side  effects  and  the  ben¬ 
efits  of  individualized  testing;  these  seem  to  be  areas  of  potential  that  await 
further  evaluative  research. 

At  present,  individualized  testing  is  thought  to  have  positive  or  at  least 
non-negative  effects  on  testees.  To  understand  the  entire  range  of  adaptive 
programs  better,  three  possible  adaptive  designs  are  considered  below. 

Curtailed  item  sampling.  This  approach,  a  naive  type  which  has  some  intui¬ 
tive  appeal,  resembles  the  examination  models  used  in  classrooms.  A  teacher 
asks  a  student  several  questions,  with  content  and  complexity  varying  according 
to  the  answers  given.  After  a  specified  period  of  time  the  teacher  steps  and 
evaluates  the  student.  In  comparing  several  oral  examinations,  considerable 
variation  would  easily  be  found  in  the  number  as  well  as  in  the  difficulty  of 
questions:  This  is  a  genuinely  adaptive  approach.  Thus,  two  students  may  earn 

the  same  grade  but  may  have  been  asked  different  questions  as  far  as  number  and/ 
or  complexity  was  concerned.  Variation  in  the  number  seems  fair  because  stu¬ 
dents  who  are  asked  more  have  a  chance  to  demonstrate  their  true  behavior  level; 
whereas  with  others,  final  evaluations  are  quite  obvious  after  only  a  few  ques- 
t ions . 

Computer-assisted  testing.  Curtailing  the  numbers  of  questions,  i.e.,  re- 
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strict ing  the  sampling  of  items  from  a  behavioral  domain,  is  a  reasonable  deci¬ 
sion.  For  adaptive  tests  this  would  mean  evaluating  a  testee's  distance  from  a 
set  of  criterion  levels.  Testing  is  stopped  when,  for  a  fixed  number  of  items, 
a  testee  is  irrevocably  located  on  either  side  of  the  decision  point.  This  may 
be  achieved  fairly  soon.  When  there  are  16  items  and  the  criterion  is  set  at 
50%,  testing  should  stop  after  8  correct  responses.  However,  this  could  occur 
when  all  the  first  8  responses  are  correct.  A  testee  who  made  an  error  on  the 
first  2  items  has  to  be  tested  for  at  least  8  more  items,  yielding  a  total  of 
10.  Varying  numbers  of  items  will  occur  when  several  students  are  tested,  one 
typical  aspect  of  flexible  adaptive  tests.  The  example  of  an  oral  examination 
given  above  dealt  with  two  possible  adaptation  criteria:  (1)  the  number  of  ques¬ 
tions  before  a  terminal  decision  can  be  made  and  (2)  the  quality  of  questions 
needed  to  make  a  procedural  decision.  A  very  flexible  adaptive  testing  program 
will  have  to  consider  both  criteria;  this  may  be  possible  with  computer-assisted 
testing  (see  Weiss,  1975,  1978). 

Paper-and-pencil  pyramidal  tests.  Since  large-scale  adaptive  testing  by 
means  of  computers  is  hampered  by  costs,  other  means  have  been  invented  and  used 
to  achieve  a  branching  test  system,  even  with  group  testing;  and  a  pyramidal 
test  design  for  use  with  paper-and-pencil  devices  has  emerged.  According  to  its 
feasibility  and  overall  value,  it  lies  somewhere  between  curtailed  sampling  and 
computer-assisted  testing.  By  pyramidal  is  meant  an  item  arrangement  that  is 
structured  like  a  network.  For  a  certain  population  the  item  locations  on  some 
dimensions  are  known. 

In  order  to  design  such  a  test,  items  are  deliberately  selected  to  form  a 
desired  hierarchical  item  order  (see  Figure  1).  At  the  top  the  testee  gets  the 
starting  item  (Item  1),  which  has  to  be  answered  by  each  candidate.  When  a  cor- 

Figure  1 

Model  of  a  Pyramidal  Item  Order 
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rect  answer  is  given,  testees  are  branched  to  the  right.  Consequently,  a  more 
difficult  item  has  to  be  attempted.  The  contrary  holds  for  an  incorrect  re¬ 
sponse.  Thus,  contingent  on  their  responses,  testees  are  individually  branched 
through  the  item  arrangement  and  will  finally  end  in  a  test  score  category  that 
tells  something  about  the  behavioral  level  attained. 

To  contrast  this  approach  with  curtailed  and  computer-assisted  testing,  it 
becomes  quite  obvious  (1)  that  there  are  available  far  more  items  than  a  given 
testee  has  to  attempt,  (2)  that  testees  find  their  individual  paths  through  the 
item  network  and  come  (or  ought  to  come)  close  to  the  upper  bounds  of  their  pro¬ 
ficiency  level,  (3)  that  testing  ends  after  a  preset  number  of  items  has  been 
attempted,  and  (4)  that  the  final  item  leads  directly  to  a  test  score  category, 
i.e.,  no  further  scoring  is  necessary  because  the  test  is  essentially  self¬ 
scoring.  The  dominant  design  feature  is  to  adapt  the  quality  of  the  items,  and 
not  their  number,  to  any  testee. 

The  pyramidal  test  is  a  fixed  strategy  as  far  as  item  number  and  arrange¬ 
ment  are  concerned,  but  a  testee  works  more  or  less  flexibly  on  items  that  are 
assumed  to  suit  his/her  proficiency  level  more  and  more.  The  technical  problems 
with  the  penc il-and-paper  format  and  group  testing  were  undertaken  by  means  of 
chemicals.  The  list  of  adaptive  test  designs  here  is  far  from  complete;  many 
other  versions  have  been  described  (e.g.,  see  Hornke,  1976;  Weiss,  1976,  1978). 
The  report  above  was  meant  to  examine  closely  various  construction  characteris¬ 
tics,  i.e.,  flexibility  in  item  number,  item  difficulty,  or  both. 

Construction  of  an  Adaptive  Pyramidal  Test 

The  studies  of  Sauter  (1978)  and  of  Hornke  et  al .  (1979),  looked  closely  at 
the  adaptive  test  format  and  especially  at  the  pyramidal  item  order  in  use.  It 
was  the  aim  of  both  Sauter  (1978)  and  Hornke  et  al.  (1979)  to  construct  and  to 
evaluate  such  a  test  design;  nevertheless,  the  choice  of  the  linguistic  item 
material  was  not  accidental.  The  pyramidal  item  order  requires  question  forms 
that  can  be  evaluated  objectively,  e.g.,  multiple-choice  items  or  items  with  a 
blank.  Moreover,  it  should  be  possible  to  rank  these  items  according  to  their 
empirical,  as  well  as  according  to  their  content,  difficulty,  which  should  re¬ 
flect  a  higher  level  of  linguistic  competence.  In  addition,  the  choice  of  the 
item  material  was  influenced  by  the  fact  that  it  was  not  possible,  or  necessary 
for  this  purpose,  to  construct  and  to  evaluate  new  items.  It  was  therefore  in¬ 
evitable  to  seek  proven  items  in  existing  tests. 

One  test  that  approximately  meets  the  above  prerequisites  is  the  Cologne 
Placement  Test  (see  Bonheim  &  Kreifelts,  1979),  which  is  a  traditional  placement 
test  for  students  at  the  beginning  of  their  first  semester  in  the  course  "Eng¬ 
lish  as  a  Foreign  Language."  It  consists  of  four  subtests:  Vocabulary,  Grammar 
and  Usage,  Reading  Comprehension,  and  Style  and  Verbal  Logic.  According  to  the 
needs  of  a  pyramidal  item  order,  reading  comprehension  items  seemed  to  fit  best. 

In  fact,  however,  it  is  not  very  easy  to  show  what  reading  comprehension 
questions  actually  do  test.  Definitions  are  usually  tautological:  "Reading  com¬ 
prehension  tests  the  ability  to  read  and  to  understand  a  particular  language." 
This  definition,  however,  covers  a  multitude  of  aptitudes  that  have  only  been 
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described  very  incompletely  up  to  now.  Some  language  and  test  experts  (see  Har¬ 
ris,  1969;  Heaton,  1975;  Lado,  1967;  Pynsent,  1972)  have  tried  to  discover  a  few 
of  the  factors  involved  and  to  put  them  into  a  hierarchical  order  with  regard  to 
their  level  of  difficulty  and  complexity.  Obviously,  at  a  more  basic  level 
reading  comprehension  requires  the  understanding  of  the  meaning  of  words  or  word 
groups  in  the  context  in  which  they  appear  as  well  as  the  recognition  of  struc¬ 
tural  clues  and  the  comprehension  of  structural  patterns.  These  aspects  of  lan¬ 
guage  are  usually  dealt  with  in  tests  of  vocabulary  and  grammar — that  is,  the 
testee  has  to  show  his/her  ability  to  ascertain  the  verbal  meaning  of  a 
straightforward  sentence  or  phrase.  On  an  advanced  level,  reading  comprehension 
involves  higher  mental  abilities,  such  as  how  to  comprehend  paragraphs  and  to 
select  the  main  ideas,  how  to  draw  conclusions  from  the  text,  and  how  to  make 
inferences  and  to  read  between  the  lines.  The  level  of  reading  comprehension 
that  is  actually  tested  depends  to  a  certain  extent  on  the  item  type  that  is 
used.  For  example: 


Example  1 


He  asked  me  to  .  him  two  thick  slices  of  beef. 

(A)  carve  (B)  slash  (C)  peel  (D)  split  (E)  shave 

(Jackson,  1976,  p.  171) 

It  is  obvious  that  this  question  form  does  not  put  too  great  a  demand  on 
the  testee's  reading  comprehension  abilities  and  can  rather  be  looked  upon  as  a 
vocabulary  item.  The  testee  has  only  to  know  that  "carve”  is  the  appropriate 
word  for  meat.  He/she  can  answer  this  item  correctly  just  on  the  basis  of 
his/her  knowledge  of  vocabulary.  To  a  limited  degree  this  item  type  can  also 
test  grammatical  knowledge  by  offering  choices/words  that  all  seem  to  fit  ac¬ 
cording  to  their  meaning;  but,  in  fact,  only  one  fits  for  syntactical  reasons. 
With  this  item  type  it  is  therefore  very  difficult  to  say  to  what  extent  reading 
comprehension  is  involved  (cf.  Jackson,  1976). 

Item  types  that  do  not  lay  too  much  stress  on  the  knowledge  of  particular 
words  are  more  usual,  and  items  consisting  of  a  short  reading  extract  of  only  a 
few  sentences  that  ask  the  testee  to  interpret  it  in  some  way  seem  more  appro¬ 
priate. 


Example  2 


Parents  can  give  their  children  enormous  help  so  long  as  they  don't  talk 
too  much,  give  the  game  away,  or  block  the  children's  thought.  "Come  a- 
long,  dear,  we're  going  to  play  with  this  lovely  clay,  let's  see  what  we 
can  make  with  it.  I  think  we  can  make  a  lovely  elephant,  come  along,  what 
about  the  trunk  dear..."  That  poor  child  will  have  made  a  mental  note  that 
whatever  he  takes  up  as  a  career  it  won't  be  sculpture. 


Why  is  this  child  called  "poor"? 

(a)  He  is  not  allowed  to  work  out  his  own  ideas. 

(b)  He  will  never  wish  to  become  a  sculptor. 

(c)  He  has  begun  to  dislike  playing  with  clay. 
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(d)  He  is  being  taught  skills  for  which  he  is  too  young. 

(Sauter  &  Hornke,  1979,  p.  165) 

Example  2  shows  clearly  that  it  tests  not  only  the  testee's  knowledge  of 
syntactical  structures  and  vocabulary  but  primarily  his/her  ability  to  interpret 
the  text  in  some  way,  for  the  correct  answer  is  not  just  a  paraphrase  of  the 
item  stem.  This  item  type  seems  to  be  capable  of  testing  what  Carroll  (1968) 
calls  "complexitv  of  information  processing — at  what  level  of  complexity  can  the 
individual  process  linguistically-coded  information?"  (p.53) 

This  should  be  the  linguistic  dimension  that  reading  comprehension  items 
test,  at  least  in  the  adaptive  test.  In  practice,  however,  it  is  very  difficult 
to  find  items  that  represent  this  dimension  even  approximately.  Even  factor- 
analytic  studies  can  give  little  help.  Thus,  it  is  inevitable  that  what  were 
regarded  as  reading  comprehension  items  in  the  above-mentioned  sense  do,  in 
fact,  correspond  rather  to  a  lower  level  of  reading  comprehension.  The  problem 
wi?h  any  test  construction  is  that  this  can  cause  some  confusion,  especially  in 
ihe  pyramidal  item  order  by  branching  lestees  to  incorrect  items  with  regard  to 
their  own  level  of  reading  comprehension  ability. 

In  this  Cologne  Placement  Test  (Bonheim  &  Kremfelts,  1979),  reading  compre¬ 
hension  items  had  been  administered  to  an  average  of  750  students  (up  to  a  maxi¬ 
mum  of  nearly  2,000  students)  from  1974  until  1978.  Since  the  placement  test 
had  been  newly  assembled  at  the  beginning  of  each  semester,  proven  as  well  as 
newly  constructed  items  were  used,  and  those  items  that  did  not  turn  out  to  be 
satisfactory  were  left  out.  The  item  pool  finally  contained  88  items  from  which 
items  were  systematically  borrowed  in  order  to  construct  the  adaptive  test. 

Each  of  the  88  items  had  been  carefully  analyzed  to  see  whether  it  could  be 
placed  at  a  certain  branching  point  within  the  pyramidal  item  order  (see  Figure 
1).  However,  with  the  present  state  of  knowledge,  these  decisions  were  not  eas¬ 
ily  made,  because  there  were  neither  guidelines  nor  previous  experience  for  item 
selection  that  could  guarantee  a  successful  branching  order.  Additional  prob¬ 
lems  that  had  to  be  solved  were  those  of  time  limits  and  the  positional  effects 
of  the  items  in  the  Cologne  Placement  Test  (for  a  detailed  description,  see 
Hornke  et  al .  1979;  Sauter,  1978;  Sauter  &  Hornke,  1979). 

Twenty-eight  items  were  borrowed  from  the  item  pool  in  order  to  form  a  py¬ 
ramidal  test,  which  consisted  of  seven  stages  and  extended  to  a  difficulty  level 
from  P  (Probability  of  a  Correct  Response)  =  .75  to  P  =  .15.  All  items  were 
placed  on  branching  positions  according  to  their  empirical  difficulty  and  dis¬ 
crimination.  Figure  2  compares  the  ideal  item  order  with  the  actual  order  that 
is  based  on  the  available  item  data.  It  shows  only  relatively  small  deviations 
from  the  positions  on  the  ideal  model. 

The  testee  begins  with  a  medium-difficult  item  (Pi  =  .45)  and  is  branched 
to  a  more  difficult  (P3  =  .40)  or  an  easier  item  (P2=  .50),  depending  on  whether 
he/she  answered  the  preceding  item  correctly  or  incorrectly  (see  Figure  2).  In 
this  way,  he/she  is  branched  through  the  item  order  until  he/she  finally  reaches 
his/her  score  group.  He/she  is  given  only  one  item  at  each  stage,  which  eventu¬ 
ally  means  that  he/she  has  to  work  on  only  7  out  of  28  items.  This  seems  to  be 
reasonable,  assuming  that  those  items  that  are  easier  than  the  items  he/she  an- 


63 


Figure  2 

Pyramidal  Order  of  the  Adaptive  English  Test  with 
the  Branching  Path  of  a  Hypothetical  Testee 
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swered  correctly  are  probably  too  easy  for  him/her.  On  the  other  hand,  those 
items  that  are  more  difficult  are  supposedly  too  difficult  for  him/her;  he/she 
would  most  probably  answer  them  incorrectly  (see  Hornke ,  1976,  1977).  Thus, 
only  those  items  are  presented  to  the  testee  that  are  most  suited  for  him/her 
using  the  pyramidal  item  order.  With  the  test  under  consideration,  the  invisi¬ 
ble  ink  response  mode  was  used  in  a  group  setting. 

Results  of  Two  Empirical  Investigations 

Two  adaptive  reading  comprehension  tests  were  investigated — one  in  a  pilot 
study  by  Sautei  (1978)  and  the  other  in  a  larger  validity  study  by  Hornke  et  al. 
(1979).  Both  studies  showed  that  adaptive  testing  by  means  of  the  paper-and- 
pencil  version  is  quite  feasible  in  group  settings.  Students  had  hardly  any 
problems  in  following  branching  instructions  properly  by  themselves. 

Validity  of  the  Pyramidal  Item  Order 

The  design  of  Sauter's  (1978)  study  asked  each  student  (1)  to  work  through 
an  item  set  of  28  items  in  the  branching  manner  and  (2)  to  solve  all  items  left 
out  during  the  branching  in  the  conventional  manner.  This  yielded  two  scores 
per  person — an  adaptive  score  and  a  conventional  score,  where  the  first  was 
based  on  7  items  and  the  second  on  21  residual  items.  Thus,  complete  response 
data  were  available  on  all  items.  This  allowed  the  validity  of  the  pyramidal 
item  order  to  be  investigated  ir.  some  detail. 

The  results  of  an  item  analysis  indicated  that  all  28  items  had  become  eas- 
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ier  than  in  the  original  conventional  test.  However,  rank  orders  between  previ¬ 
ous  and  present  item  difficulties  correlated  as  highly  as  r^  =  .77,  indicating 
that  the  order  as  such  had  largely  survived.  Of  particular  interest  was  the 
correlation  between  scores  on  the  7  adaptive  items  and  the  21  conventional  re¬ 
maining  items,  which  was  _r  =  .47  for  93  testees.  Taking  the  unreliability  of 
the  entire  set  of  items  into  account,  however,  a  stepped-up  correlation  of  £  = 
.64  resulted.  Thus,  a  score  based  on  the  7  optional  items  had  quite  a  reason¬ 
able  predictive  power  to  a  score  based  on  21  items. 


Validity  of  the  Adaptive  Test 


The  second  study  (Hornke  et  al . ,  1979)  had  two  main  purposes,  namely,  to 
investigate  the  validity  of  an  adaptive  test  and  to  look  at  the  details  of  py¬ 
ramidal  item  hierarchies.  In  order  to  answer  the  first  question,  a  multitrait 
approach  was  used.  According  to  the  underlying  theory,  reading  comprehension 
items  ought  to  call  for  processes  that  are  different  from  vocabulary  or  grammar 
exercises.  Thus,  it  was  expected  that  there  would  be  a  closer  relationship  be¬ 
tween  scores  for  an  adaptive  and  a  conventional  reading  comprehension  test  than 
with  scores  from  both  grammar  and  vocabulary  tests.  The  study  used  a  two- 
method — Adaptive  versus  Conventional — by  three-trait — Reading  Comprehension  (RC) 
x  Gramme r  (G)  x  Vocabulary  (V) — design.  Due  to  financial  restrictions,  however, 
it  was  impossible  to  investigate  adaptive  and  conventional  test  formats  with  all 
three  traits.  The  study  thus  contrasted  adaptive  versus  conventional  reading 
comprehension  only. 


It  is  quite  obvious  that  all  three  traits  should  correlate  with  each  other 
because  they  are  genuine  parts  of  language  behavior  themselves.  However,  the 
results  in  Table  1  indicate  that  despite  all  that  they  have  in  common,  the  three 
item  sets  measured  quite  differentiable  aspects  that  pertain  to  the  hypothesized 
discriminant  relation.  This  means,  too,  that  the  data  warrant  an  interpretation 
of  three  different  traits,  even  though  intercorrelations  were  not  zero  (but  they 
are  low  enough). 


However,  reading  comprehension  scores,  assessed  either  in  the  adaptive  or 
in  the  conventional  way,  did  not  converge  to  the  extent  expected.  The  resulting 
correlations  were  too  low  for  tests  designed  to  measure  the  same  trait.  The 
correlation  between  RCj  (adaptive)  and  RCj  (conventional  remainder)  especially 
contradicted  any  convergent  interpretation,  despite  the  fact  that  both  item  sets 
are  virtual  subsets  of  a  larger  one.  Here,  a  correlation  of  .6  to  .7  would  be 
more  suitable  to  justify  any  convergence.  It  still  remains  an  open  question 
whether  adaptive  branching  of  items  used  with  reading  comprehension  tests  intro¬ 
duced  a  source  of  error  or  variation  that  accounted  for  the  low  correlations.  A 
comparison  of  RC!  (conventional  remainder)  with  the  RC2  (conventional)  scores 
indicates  some  dissimilarity  in  the  item  sets,  which  appear  to  be  more  different 
than  their  common  label  would  lead  one  to  expect. 


Conclusions 


Although  adaptive  tests  are  initially  intriguing,  there  are  many  problems 
to  overcome.  The  major  problem  lies  in  the  fact  that  for  foreign  language  test¬ 
ing,  a  properly  defined  construct  is  necessary.  Consequently,  all  items  ought 
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Table  1 

Correlations  Between  All  Tests  and  Formats  Used 


RC  i 


Convent ional 

Adapt ive 

Remainder 

Conventional  (28 

Items ) 

Variable  (7 

Items ) 

(21  Items) 

rc2 

G 

V 

Convergent 

RCj 

Adapt ive 
Convent ional 

— 

.405 

.379 

Remainder 

.531 

- 

.419 

rc2 

Conventional 

.218 

(.403) 

- 

Discriminant 

G  (Conventional) 

.295 

.511 

(.068) 

(.419) 

V  (Conventional) 

.355 

.431 

(.214) 

- 

Note.  Correlation  coefficients  in  parentheses  are  based  on  group 
means  instead  of  individual  data. 

to  belong  to  an  appropriately  defined  behavioral  domain.  This  is  not  always 
easy  to  achieve,  and  there  might  often  be  a  lack  of  expert  consensus.  Instead, 
empirical  studies  are  needed  to  substantiate  any  item's  relation  to  the  con¬ 
struct  in  question. 

A  quite  substantial  problem  for  adaptive  tests  may  be  seen  in  the  necessary 
heierarchical  order  for  a  pyramidal  arrangement.  Any  branching  decision  here 
implies  strongly  that  the  hierarchy  is  valid  and  stable  across  samples  of  the 
population.  The  two  studies  cited  above  indicated,  however,  that  this  may  not 
be  the  case.  As  far  as  there  are  changes  in  item  difficulties  from  one  sample 
to  the  other,  this  might  not  matter  very  much  as  long  as  all  item  positions  stay 
within  the  hierarchical  order  intended.  Whenever  there  are  changes  or  shifts  in 
positions,  the  pyramid  is  invalidated  locally,  and  false  branching  occurs.  To 
circumvent  this  problem,  rigorous  item  analysis  may  help  to  keep  this  weakness 
within  limits.  It  has  to  be  questioned,  too,  whether  difficulty  indices  (i.e., 
the  proportions  of  answers  correct)  are  good  and  reasonable  criteria  for  a  hier¬ 
archical  ordering  of  items.  With  narrowly  defined  populations  and  applications, 
this  might  be  practicable.  However,  better  estimates  of  an  item's  scale  and 
hierarchical  position  are  available  and  should  be  used.  With  these  two  studies 
cited,  it  was  not  possible  to  perform  item  analyses,  since  data  were  not  avail¬ 
able  for  this  purpose. 

Taking  these  two  arguments  together,  it  follows  immediately  that  there  will 
be  hardly  any  chance  to  take  a  conventional  test,  to  rearrange  its  item  order, 
and  to  get  an  adaptive  version.  With  any  test  construction,  careful  item  writ¬ 
ing  and  analysis  is  necessary.  This  is  true  for  adaptive  as  well  as  convention¬ 
al  tests;  ad  hoc  test  construction  hardly  conforms  to  the  careful  scrutiny  that 
is  called  for.  It  should  not  be  expected  that  adaptive  or  conventional  tests 
from  this  source  have  any  value  in  decision  making  at  all.  In  foreign  language 


testing  only  after  a  good  deal  of  research  and  empirical  investigation  has  been 
carried  out  will  there  be  adaptive  tests  for  a  variety  of  purposes;  but,  in 
fact,  they  are  essential  in  a  program  where  students'  proficiency  is  expected  to 
vary  considerably  and  where  decisions  of  some  kind  are  to  be  made. 
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Computerized  Testing  in  the 
Federal  Armed  Forces 


Wolfgang  Wildgrube 
German  Ministry  of  Defense 


The  Federal  Armed  Forces  (FAF)  consists  of  about  480,000  soldiers  (240,000 
of  these  are  draftees);  the  FAF  administration  comprises  170,000  civilians;  and 
in  the  FAF  Psychological  Service  there  is  a  civilian  staff  of  1,300  psycholo¬ 
gists.  Figure  1  presents  an  overview  of  the  organization  of  the  FAF  Psychologi¬ 
cal  Services.  The  center  of  activities  is  in  personnel  psychology,  with  more 
than  80%  of  the  psychologists  in  the  area  of  aptitude  diagnosis.  Figure  2  shows 
the  psychological  aptitude  testing  procedures  for  selection  and  classification 
for  both  the  FAF  and  the  FAF  administration.  Aptitude  diagnoses  are  carried  out 
for  various  purposes  for  large  samples,  such  as  for  draftees  (about  300,000  di¬ 
agnoses  per  year);  for  volunteers  (about  30,000  per  year);  for  advancement  from 
sergeant  to  an  officer  career;  and  for  selection  of  pilots,  pyrotechnists,  civil 
servants,  and  personnel  for  linguistic  services.  Aptitude  and  intelligence 
tests  are  administered  by  paper  and  pencil  to  groups  of  about  50  persons.  Spe¬ 
cial  apparatus  tests  or  other  special  procedures  and  psychological  interviews 
follow  as  necessary,  dependent  on  the  selection  process  or  on  the  individual 
result.  With  these  procedures,  the  Psychological  Service  thus  attempts  to  make 
the  best  possible  personnel  decision. 

Problems 


The  large  number  of  testing  procedures  and  the  wide  areas  of  testing  create 
numerous  problems.  Mass  testing  (about  350,000  testees  per  year)  requires  a 
large  quantity  of  material  and  manpower.  The  test  application,  scoring,  and 
decision-making  consist  of  many  routine  activities  that  require  a  great  expendi¬ 
ture  of  personnel. 

For  each  selection  procedure  all  testees  of  a  group  process  standard  test¬ 
ing  batteries:  All  testees  undergo  the  same  test  battery  during  a  limited  peri¬ 
od  of  time.  For  a  certain  number  of  testees  the  test  is  too  difficult;  for  oth¬ 
ers,  too  easy.  Thus,  motivation  decreases  and  fatigue  increases.  Special 
knowledge,  attitudes,  or  personal  spheres  of  interest  or  inclinations  are  not 
taken  into  consideration.  Moreover,  very  rarely  are  special  procedures  possi¬ 
ble,  so  that  in  the  limited  time  allotted  only  some  aptitude  dimensions  are  car¬ 
ried  out  in  an  undifferentiated  manner. 

At  present  the  mass  data,  collected  by  paper  and  pencil,  do  not  permit  fol¬ 
low-up  analyses.  Statistical  evaluations  of  the  testee  data  are  impossible,  and 
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changes  within  the  tests — for  singular  items  or  for  the  whole  norm  values — are 
not  analyzed.  Technical,  organizational,  and  legal  problems  (as  for  instance, 
the  security  of  tests)  are  connected  with  the  mass  testing  and  the  different 
areas  of  aptitude  diagnoses.  It  is  necessary  that  the  tests  and  the  selection 
procedures  be  modified  shortly.  Above  all,  not  only  do  the  tests — that  is,  the 
psychological  selection  procedures  within  the  scope  of  decisions  or  careers — be¬ 
come  obsolete  very  soon  but  the  patterns  for  their  solutions  (the  items  and  the 
corresponding  correct  answers)  become  known  after  a  very  short  time.  It  is  not 
possible  to  perform  a  permanent  modification  in  addition  to  the  tests  for  career 
selection  with  the  limited  capacities  available  for  such  updates. 

Requirements  for  the  Diagnostic  Process 

Cognizance  of  these  problems  of  aptitude  diagnoses  as  well  as  the  daily 
practice  in  the  FAF  provides  a  basis  for  the  following  requirements  for  future 
diagnostic  work: 

1.  Improvement  of  the  diagnoses  is  necessary;  greater  importance  should 
be  given  to  the  differential  diagnoses.  A  useful  method  should  be 
found  for  solving  the  "bandwidth  fidelity  dilemma"  so  that,  in  spite 
of  the  use  of  mass  testing,  differential  decisions  are  possible  ("the 
right  person  in  the  right  place").  This  problem  will  become  urgent 
for  the  FAF  from  about  1985  onward,  when  there  will  not  be  enough 
draftees  available  because  of  the  rapid  decline  of  the  birthrate  in 
the  late  1960s. 

2.  Paper-and-pencil  tests  alone  will  not  suffice  in  the  future;  with  the 
improvement  of  diagnoses  and  the  consideration  of  further  aspects, 
skills,  and  experiences,  it  will  be  necessary  to  include  new  testing 
procedures  and  to  test  other  psychological  dimensions.  Additionally, 
interests,  motivations,  and  personality  aspects  should  be  tested,  and 
perception  and  motor  tests  should  be  carried  out  to  make  more  perfect 
diagnoses . 

3.  In  addition  to  the  test  result — the  score  or  ability  parameter — other 
data  should  be  included  in  the  diagnostic  process.  Therefore,  re¬ 
search  programs  concerning  the  testing  process  are  necessary,  includ¬ 
ing  item  solution  time  or  time  needed  for  solving  a  subtest,  so  that 
testing  protocols  (e.g.,  for  counseling)  can  be  produced. 

4.  Finally,  mass  testing  makes  it  necessary  to  develop  economy  in  the 
entire  testing  process  and  aptitude  diagnosis.  Scores  and  other  com¬ 
putations  should  be  carried  out  during  the  session,  and  results  should 
be  directly  available  at  the  end  of  testing.  With  these  procedures 
and  the  proposed  applications  of  items  and  subtests,  it  will  be  possi¬ 
ble  to  save  time  and,  moreover,  to  improve  the  diagnostic  process. 

Potential  Solutions 

Computerized  testing  will  provide  solutions  to  these  problems  in  the  fol¬ 
lowing  three  areas: 
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Item  production.  Parts  of  item  production  can  be  performed  by  computer- 
assisted  test  construction  (CATC).  In  a  separate  project,  software  was  produced 
and  implemented  for  item  production  and  for  individualization  of  tests,  modify¬ 
ing  the  tests  by  computer.  The  first  computer  tests  are  in  the  empirical  phase, 
and  extensive  results  are  expected  in  1980. 

Test  data.  For  computation  and  interpretation  of  test  data  (selection  and 
decision;  "the  diagnostic  process")  multi-faceted  aids  are  possible:  Simula¬ 
tions  are  being  used  in  the  FAF  for  computerized  decision-making,  and  possibly 
the  test  results  will  be  used  to  call  up  draftees. 

Use  of  tests,  the  presentation  of  items,  and  scoring  procedures.  In  addi¬ 
tion  to  the  presentation  and  computation  of  items  using  the  classical  concepts, 
there  is  a  special  case  of  test  application--Computerized  Adaptive  Testing 
(CAT).  Considerable  savings  and  improvements  of  the  aptitude  diagnoses  in  the 
FAF  are  expected,  especially  from  the  adaptive  methods  and  the  new  techniques  of 
CAT. 


Components  of  Computerized  Testing 

For  the  planning  stage  and  implementation  of  computerized  testing  in  the 
FAF  a  catalog  was  produced,  containing  the  most  important  components  of  comput¬ 
erized  testing  and  therewith  also  of  CAT.  These  components,  some  of  which  will 
be  empirically  investigated  by  the  FAF,  include  the  following  (the  minimum  re¬ 
quirements  are  preceded  by  an  *): 

Hardware 


The  requirement  is  defined  to  set  up  a  test  station,  for  example,  for  50 
testees  carrying  out  diagnostic  procedures  of  draftees.  Many  technical  details 
(e.g.,  conception:  connection  to  a  large-size  computer  or  stand-alone  terminal 
station  or  a  microprocessor  for  each  testee;  CPU  and  periphery,  special  screen 
and  keyboards,  and  other  facilities)  are  clarified  and  compared.  Different 
products  will  be  rated  with  regard  to  the  requirements  in  the  FAF. 

1.  *Requirement  for  flexibility  of  technology  (e.g.,  extensions,  innova¬ 

tions)  and  modular  concept  of  hardware; 

2.  *Concept  of  the  test  station  (connection  to  a  large-size  computer  or 

stand-alone  computer  or  microprocessor  for  each  testee);  system  of 
minicomputers  with  foreground  (input/output  operations)  and  background 
(e.g.,  computations,  estimations);  multi-tasking,  multi-processing; 

3.  *Central  Processing  Unit/Core  Memory:  construction,  capacity/size,  re¬ 

sponse/access/cycle  time  (e.g.,  station  with  50  terminals  testing 
draftees);  byte  or  word,  bit  per  word;  accuracy/precision;  floating 
point  arithmetic  (hardware  or  software,  binary  or  decimal;  number  of 
bits  for  parameter  estimations);  *real-time  execution,  system-response 
time  (processing  an  input  immediately,  without  delay  time); 
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4.  *A  printer  for  each  testing  station  (production  of  testing  protocols, 

plots);  console  display;  possibility  of  storage,  capacity  of  disks 
(magnetic  disks  or  floppy  disks);  other  storage  on  periphery;  access 
mode  and  time;  *archives/output  of  the  raw  data,  compatibility  (making 
copies  to  magnetic  tapes,  computations  on  an  IBM  large-size  computer); 
processing  the  data  on  line/off  line  (among  others,  for  the  personnel 
division,  using  the  test  data  in  the  data  bases);  connection  to  other 
computers  in  the  FAF;  definition  of  interfaces; 

5.  Equipment  for  a  testing  station  for  each  testee  (number  of  places  con¬ 
nected  to  one  processor);  *special  displays  for  presentation  and  pro¬ 
cessing  the  items;  special  keyboards  (only  digits  and  few  buttons); 
display  quality  (sharp  definition,  contrast);  graphic  with  200,000 
points,  color  equipment;  use  of  video  pictures;  periphery,  connection 
of  further  equipment/devices  ( tachistoscope ,  light  pencil  for  figural 
tests  or  labyrinth  items);  usage  of  other  apparatus  or  testing  addi¬ 
tional  psychological  dimensions  with  hardware  or/and  software  (e.g., 
determination  tool);  controlling  the  testing  process  by  acoustic  stim¬ 
ulus,  input  of  the  answers  using  the  terminal  keyboard;  employment  of 
an  A/D  converter,  making  digitals  using  the  physiological  data  or  fur¬ 
ther  testee  data  from  other  equipment; 

6.  *Infrastructure  (e.g.,  power,  power  consumption,  air  conditioning); 

*mobility,  possibility  for  transportation  when  testing  draftees  at 
different  locations. 

Test  Applications/Concepts 

The  type  of  aptitude  diagnoses  to  be  taken  over  by  a  computer  needs  to  be 
specified,  for  example,  which  psychological  dimensions  should  be  tested,  which 
contents  and  methods  should  be  used  during  the  pilot  projects  (among  others,  the 
item-response  time  for  ability  estimation),  and  which  further  tasks  (e.g.,  ''ext 
item  presentation  and  scoring)  are  possible  with  the  test  station,  for  example, 
computerized  decision-making  or  counseling  aspects. 

1.  *Flexibility  for  using  different  tests  or  methods;  flexibility  for  time 

limits,  sequence  of  subtests,  power/speed  tests,  types  of  items,  item 
material;  flexibility  for  different  data,  changing  the  input  of  the 
test  station  (e.g.,  insertion  of  personal  data  or  item-solution 
times);  recording  further  psychological  dimensions  (perception,  motor 
skills,  concentration,  coordination,  fatigue,  curves  of  learning, 
tracking);  recording  of  interests,  motivations,  personality  aspects; 
*possibility  for  different  testing  processes,  omnibus  procedures  ver¬ 
sus  criterion-referenced  measurement; 

2.  *Application  of  tests  using  the  classical  concept,  presentation  of  con¬ 

ventional  items  by  display  (such  as  the  present  procedure  for  draft¬ 
ees);  ^jumping  to  different  items,  similar  to  the  pa per -and -pencil 
application  (selection  of  different  items  by  the  testee,  jumping  for¬ 
ward  and  backward,  as  in  a  test  book);  *usage  of  sequential  strategies 
based  on  subtests  (screen  and 'main  test;  indication  of  "critical 
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items"  for  the  next  subtest  or  for  use  in  the  interview);  *processing 
the  tests  in  groups  of  testees  but  continued  application  of  individual 
tests/subsets  of  items; 

3.  Testing  the  pyramidal  approach  with  the  self-scoring  aspect  (in  sensu 
Hornke,  1978);  *application  of  tests  with  variable  branching  strate¬ 
gies,  using  different  methods,  different  algorithms  for  parameter  es¬ 
timations,  different  scoring  procedures,  different  criteria  for  cut¬ 
off;  solving  different  methological  problems  using  different  estima¬ 
tion  procedures  (Bayesian,  maximum  likelihood,  and  so  forth);  inquiry 
of  CPU/execution  time  using  different  methological  approaches  (diffe¬ 
rent  probabilistic  models,  various  software); 

4.  Input  of  additional  criterion  data  (e.g.,  age,  date  of  final  gradua¬ 
tion  from  school),  interests,  special  knowledge;  recording  the  bio¬ 
graphic  data  (using  a  questionnaire  or  free  responses);  *immediate 
computation  of  test  data  during  the  test  process  so  that  results  are 
finished  at  the  end  of  the  session  (i.e.,  scoring,  norm  values);  in¬ 
terpretation  of  the  test  data,  computerized  diagnoses  (classification 
with  discriminant  or  cluster  analyses);  decision-making,  placement 
recommendation  for  the  draftees,  taking  into  consideration  the  differ¬ 
ent  requirements,  priorities,  or  various  criterion  data  of  the  armed 
forces;  computerized  personnel  management  (in  contact  with  the  data 
bases  for  the  military  personnel  in  the  FAF);  additional  use  of  the 
test  station  for  counseling  aspects  (e.g.,  possibilities  of  career, 
study  at  the  universities  of  the  FAF); 

5.  Possibility  for  giving  feedback,  processing  several  subtests;  noting 
time  limit  if  tests  with  time  limit  are  in  use  (rest  time  per  subtest, 
time  used  per  item);  *recording  the  item-solution  time  and  processing 
the  time  as  an  additional  ability  estimator  or  for  counseling;  produc¬ 
ing  testing  protocols  with  the  response  patterns  (method  for  solving 
the  subtest); 

6.  Possibility  of  computerized  test  construction;  computation  of  follow¬ 
up  analyses,  validity  approaches,  and  so  forth. 

Software 


The  system  and  the  assembler  programs  monitoring  the  microcomputer,  the 
possibilities  for  updating,  the  compatibility  to  an  IBM  large-size  computer  for 
follow-up  analyses,  and  the  real-time  execution  for  presentation  and  computation 
of  items  should  all  be  considered. 

1.  *Requirement  for  a  modular  system  of  software,  implementation  of  new 

methods  and  testing  procedures  within  a  short  time; 

2.  *Conversational/dialog  program  for  processing  the  test  sessions  (selec¬ 

tion  of  items,  presentation,  and  computation;  processing  the  item-so¬ 
lution  time;  possibly  giving  feedback);  supervision  of  the  test  sta¬ 
tion  (e.g.,  input/output ,  computations,  interruptions,  error  han- 


dling);  monitoring  the  test  process,  operating  log  (e.g.,  internal 
statistics  for  usage  of  the  subtests,  items,  error  for  handling,  CPU 
time);  *introduction  for  handling  the  CRT  and  the  keyboard,  processing 
of  examples,  operating  the  keyboard  by  various  types  of  items;  *check 
of  the  input  for  formal  correctness  (e.g.,  only  one  digit  permissible 
or  only  a  digit  less  than  5); 

3.  Requirement  for  programming  the  minicomputer  by  the  user  (e.g.,  the 

psychologist);  using  the  higher  software  languages,  such  as  FORTRAN  or 
BASIC  (interpreter  or  compiler);  installation  of  a  compiler  for  all 
stations  or  only  for  the  development  institution;  usage  of  overlay 
techniques  or  virtual  storage  concepts,  optimizing  the  core  capacity; 
expense  for  programming,  implementation  of  new  tests,  new  methods,  new 
software;  support  by  utilities;  improving  the  software  and  the  assem¬ 
bler  programs;  updating  the  system  of  the  minicomputer  (e.g.,  presen¬ 
tation  of  the  items,  data  management  to  archive  the  raw  data,  initial 
calculations);  *storage  of  the  data  for  follow-up  analyses,  transfer¬ 
ring  to  a  file  of  a  large-size  computer  (calculations  by  SPSS  or  other 
software),  development  of  software  using  a  large-size  computer  via 
teleprocessing,  simulating  the  minicomputer  (e.g.,  conversational  pro¬ 
cessing,  compiler,  assembler); 

Organization  and  Usage 

Checkpoints  are  the  organization  of  the  testing  session  during  the  entire 
selection  process  (with  sport  examination,  medical  check-up  by  physician,  inter¬ 
view  by  psychologist,  and  so  forth),  operating  the  test  station  and  the  single 
screen/keyboard,  handling  for  system  trouble,  and  maintenance  services. 

1.  *Requirement  for  simplicity  of  operations  (nonspecialized  operation  of 

the  testing  station);  *explanation  for  handling  the  CRT  and  keyboard 
for  the  testees  (e.g.,  input,  corrections,  skipping  forward  and  back¬ 
ward,  giving  assistance  by  a  function  HELP);  monitoring  the  test  pro¬ 
cess  using  the  classical  concept,  i.e.,  for  side-by-side  terminals, 
parallel  versions  are  presented; 

2.  Handling  the  test  station  for  system  trouble;  restarting/restoring  the 
system,  rerunning  the  session,  continuing  with  similar  items  (control¬ 
ling  the  last  transfer  operation,  the  last  processed  item;  successful 
processing  of  the  last  written  operation;  security  of  data  (safe  dump 
of  the  the  raw  data); 

3.  Breakdown  time;  maintenance  services,  agreement;  spare  parts;  require¬ 
ment  for  high  readiness  of  operations; 

4.  Cost  for  purchase  or  lease,  for  maintenance  and  spare  parts,  and  for 
operation  (price-performance  ratio); 


5. 


Specific  points  of  the  firms  (special  features  not  described  by  the 
requirements  above). 


-  76 


Planning  and  Procedure 

Following  the  information  and  concept  phase,  in  which  information  is  col¬ 
lected  and  redefined  for  application  of  adaptive  tests  using  a  computer  and  for 
incorporation  into  the  FAF,  the  first  research  programs  are  planned  for  examina¬ 
tion  and  trial  of  the  different  contents,  methods,  and  techniques;  and  the  cor¬ 
responding  pilot  projects  are  prepared.  After  checking  computerized  test  appli¬ 
cations  in  the  FAF — their  methods  and  te'hniques — the  following  parts  and  steps 
are  designed: 

1.  Application  of  tests  using  the  classical  concept;  presentation  of  con¬ 
ventional  items  by  display;  research  program  for  the  "psychology  of 
computerized  testing";  an  experiment  by  Birke  (1979)  on  the  use  of 
item-solution  time  as  an  additional  ability  estimator; 

2.  Testing  the  pyramidal  approach  with  self-scoring  (in  sensu  Hornke, 
1977,  1978);  and 

3.  Application  of  tests  with  variable  branching  strategies  using  differ¬ 
ent  methods  and  approaches. 

For  these  pilot  projects  item  pools  have  been  prepared  and  larger  tests/ 
subtests  are  presently  in  preparation.  The  extensive  software  should  be  pro¬ 
duced  in  FORTRAN  using  the  existing  TSO  connection  to  an  IBM  computer  370/168 
and  simulated  corresponding  test  applications.  Parallel  to  the  planning  of  con¬ 
tent  and  methods  is  the  procuring  of  hardware,  considering  the  components  as 
previously  designated  in  the  catalog. 

Since  last  yeat  the  FAF  has  had  intensive  contacts  with  the  German  firms 
Zak  and  Hogrefe,  which — after  many  years  of  experience  with  the  production  of 
psychological-physiological  tools — have  offered  microprocessor-based  stand-alone 
computers  for  test  application  and  for  analyses  of  physiological  data.  Both 
firms  are  in  the  development  phase,  thus  all  offers  have  still  not  been  realized 
(e.g.,  graphic  equipment  for  200,000  points,  light  pencil,  use  of  video-tapes). 
Zak  offers  a  modular  system  with  10  intelligent  terminals,  two  floppy  disk 
drives,  and  the  central  processor  for  one  station;  whereas  Hogrefe  offers  a 
screen,  a  CPU,  and  a  floppy  disk  for  each  testee. 

The  developments  in  the  market  are  being  observed  and  checked.  Based  on 
the  requirements  of  the  FAF,  directions  and  concomitant  requests  are  being  for¬ 
mulated,  and  the  German  Ministry  of  Defense  is  providing  a  test  station  for  the 
first  pilot  project  for  computerized  test  application. 

Conclusion 


The  Psychological  Services  of  the  FAF  is  today  at  a  starting-point  of  a 
new,  rapid  development  of  the  testing  process  and  aptitude  diagnoses.  At  this 
time  there  is  neither  background  experience  nor  a  special  approach  to  computer¬ 
ized  testing  in  the  FAF.  Until  now,  problems  of  discussion  and  research  designs 
have  been  oriented  toward  the  practice  in  the  FAF,  derived  from  the  everyday 
aptitude  diagnoses  requirements.  I  am  certain,  however,  that  in  the  coming 
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years  the  traditional  concept  of  testing  by  using  paper  and  pencil  will  be  elim¬ 
inated  . 
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There  are  many  applications  of  testing  technology  that  require  decisions  to 
be  made  as  to  whether  a  person  is  above  or  below  a  criterion  score.  Criterion- 
referenced  testing  and  its  special  case,  mastery  testing,  are  examples  of  such  a 
decision.  In  the  criterion-referenced  testing  application,  it  would  be  espe¬ 
cially  useful  if  decisions  could  be  made  quickly  and  conveniently  for  each  stu¬ 
dent  in  an  individualized  instruction  program.  The  recently  developed  technolo¬ 
gy  of  tailored/ adaptive  testing  (Lord,  1970)  has  the  potential  to  fulfill  the 
requirements  of  such  a  testing  system.  However,  there  is  no  generally  accepted 
procedure  for  making  classification  decisions  using  tailored  testing,  probably 
because  these  testing  techniques  are  still  relatively  new.  The  few  procedures 
that  do  exist  are  either  based  on  randomly  sampling  items  (Epstein,  1978;  Sixtl, 
1974),  which  does  not  take  advantage  of  the  power  of  tailored  testing,  or  on 
heuristic  techniques  (Weiss,  1978),  which  do  not  have  a  sound  theoretical  base. 
The  purpose  of  this  paper  is  to  present  some  decision  procedures  that  operate 
sequentially  and  can  easily  be  applied  to  tailored  testing  without  loss  of  any 
of  the  elegance  and  mathematical  sophistication  of  the  examination  procedures. 

Tailored  Testing  Procedures 


Numerous  tailored  (i.e.,  adaptive,  response  contingent,  sequential)  testing 
procedures  now  exist  in  the  research  literature,  ranging  from  simple  two-stage 
procedures  (Betz  &  Weiss,  1973)  to  complex  Bayesian  procedures  (Owen,  1969;  see 
Weiss,  1974,  for  a  good  review  of  the  tailored  testing  procedures  that  were  de¬ 
veloped  prior  to  1974.)  Although  many  procedures  exist,  for  the  purposes  of  this 
paper  only  tailored  testing  procedures  using  item  characteristic  curve  (ICC) 
theory  and  maximum  likelihood  ability  estimation  will  be  considered.  It  will- 
also  be  assumed  that  the  tests  are  administered  to  the  examinees  on  a  computer 
terminal  and  that  the  items  are  selected  to  maximize  the  value  of  the  informa¬ 
tion  function  at  the  previous  ability  estimate.  Despite  the  narrow  definition 
of  tailored  testing  used  for  this  paper,  the  results  should  generalize  to  any 
procedure  based  upon  ICC  theory. 

In  applying  the  decision  procedures  discussed  in  this  paper,  two  specific 
ICC  models  will  be  used:  the  1-  and  3-parameter  logistic  models.  Although  any 
other  ICC  model  could  just  as  easily  have  been  used,  these  models  were  selected 
because  of  their  frequent  appearance  in  the  research  literature  and  because  of 
the  existence  of  readily  available  calibration  programs  (LOGIST,  CALFIT)  and 
tailored  testing  programs  (Reckase,  1974). 


Sequential  Decision  Procedures 


A  cursory  review  of  the  statistical  literature  indicates  that  much  has  been 
written  about  sequential  estimation  and  classification  procedures.  Although 
somewhat  more  obscure  than  ANOVA  and  regression  procedures,  most  intermediate 
level  mathematical  statistics  books  include  at  least  one  chapter  on  sequential 
analysis  (for  example,  see  Brunk,  1965,  chap.  16).  In  an  ongoing  review  of  the 
extensive  literature  on  this  topic,  it  has  been  found  that  most  procedures  fall 
into  one  of  three  categories:  1)  sequential  probability  ratio  tests  (SPRT; 

Wald,  1947),  (2)  Bayesian  sequential  procedures  (e.g.,  DeGroot,  1970),  and  (3) 
curtailed  single  sampling  plans  (Dodge  &  Romig ,  1929).  Of  these  procedures, 
only  the  SPRT  is  narrowly  specif ied-- the  other  two  refer  to  families  of  proce¬ 
dures  rather  than  a  single  technique. 

Although  these  statistical  procedures  are  widely  applied  for  quality  con¬ 
trol,  little  use  has  been  made  of  them  in  the  area  of  mental  testing,  probably 
because  operable  sequential  testing  procedures  did  not  exist  until  recently.  To 
date  all  references  in  the  testing  literature  to  sequential  decisions  have  used 
the  SPRT  (Epstein,  1978;  Reckase,  1978;  Sixtl,  1974).  The  SPRT  will  therefore 
be  described  first,  followed  by  the  Bayesian  procedures,  since  the  curtailed 
sampling  plans  cannot  readily  be  applied  to  the  commonly  used  tailored  testing 
procedures,  they  will  not  be  discussed  in  this  paper. 

The  Sequential  Probability  Ratio  Test 


The  sequential  probability  ratio  test  (SPRT)  was  initially  developed  by 
Wald  (1947)  as  a  quality  control  device  for  use  by  the  Armed  Forces  during  World 
War  II.  In  addition  to  Wald's  (1947)  excellent  book  on  the  subject,  this  proce¬ 
dure  has  been  clearly  described  by  Epstein  (1978).  It  will,  therefore,  be  only 
briefly  described  here  in  order  to  generalize  the  procedure  so  that  it  will  more 
directly  apply  to  tailored  testing. 

Application  to  Mastery  Decisions 


Wald  originally  developed  the  SPRT  as  a  statistical  test  to  decide  which  of 
two  simple  hypotheses  is  more  correct.  For  example,  it  might  be  interesting  to 
determine  whether  a  student  can  answer  correctly  60%  or  80%  of  the  items  in  an 
item  pool.  The  basic  philosophy  behind  the  procedure  used  to  decide  between 
these  two  alternatives  was  to  determine  the  likelihood  of  an  observed  response 
to  an  item  under  the  two  alternative  hypotheses.  If  the  likelihood  were  suffi¬ 
ciently  larger  for  one  hypothesis  than  the  other,  that  hypothesis  would  be  ac¬ 
cepted.  If  the  two  likelihoods  were  similar,  another  observation  would  be 
taken.  Wald  (1947)  has  shown  that  one  hypothesis  will  always  be  selected  over 
another  using  a  finite  set  of  items. 

To  demonstrate  this  procedure,  suppose  an  item  is  randomly  selected  from  an 
item  pool  and  administered  to  a  student.  If  a  correct  response  were  obtained, 
the  likelihood  under  Hi  (80%  knowledge)  would  be  .80,  and  the  likelihood  under 
H0  (60%  knowledge)  would  be  .60.  To  evaluate  these  likelihoods,  Wald  takes  the 
ratio  of  the  two, 
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L(x  =  1  j) 

L{x  =  l|ff0)  =  760  =  1,67  '  tlJ 

If  the  ratio  Is  sufficiently  large,  H2  Is  accepted;  if  It  Is  sufficiently  small, 
H0  is  accepted;  and  if  It  is  near  1.0,  another  observation  is  taken.  The  values 
of  this  ratio  that  are  considered  sufficiently  large  or  small  depend  upon  what 
is  considered  acceptable  for  the  two  possible  decision  errors:  (1)  accepting  H2 
when  H0  is  true  (a  error)  and  (2)  accepting  H0  when  H2  is  true  (8  error). 

Although  Wald  (1947)  developed  a  procedure  for  determining  the  exact  values 
of  these  decision  points,  the  procedure  is  very  complex  and  is  seldom  used. 
Instead,  good  approximations  can  be  determined  using  the  following  formulas: 


lowe  r 
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-  6 
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[3] 

Thus,  if  the  likelihood  ratio  is  less  than  or  equal  to  B,  H„  is  accepted  with 
error  probability  approximately  0.  If  the  likelihood  ratio  is  greater  than  or 
equal  to  A,  Hj  is  accepted  with  error  probability  approximately  a.  If  the  ratio 
is  between  B  and  A,  another  item  should  be  randomly  sampled  and  administered  and 
the  decision  rule  implemented  again.  If  a  =  .05  and  0  “  .10,  for  example,  the 
decision  points  would  be  at  B  =  .105  and  A  *  18.  Since  the  likelihood  ratio 
(1.67)  is  between  these  two  values,  no  decision  would  be  made,  and  another  item 
would  be  selected  and  administered. 

Since  the  responses  to  the  items  follow  a  binomial  distribution  in  this 
example,  a  general  expression  for  the  likelihood  ratio  can  be  developed  for  the 
administration  of  ti  items: 

L(xx,  *2,...,  _  p*Xi(  1  -  P]  )"  lXi 

L{x  ,  xTTTTT  x  I H  )  Ex.  n-tx  . 

1  n  1  t  / 1  ■>  t 

p0  (1  ■  V 


where 

x^  is  the  score  on  item  (0  or  1), 

j>  is  the  proportion  of  items  known  by  the  student  in  the  item  pool  under 
H  j ,  and 

p  is  the  proportion  known  in  the  item  pool  under  H0 . 
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Otherwise,  continue  administering  items. 

This  procedure  was  originally  developed  to  test  simple  hypotheses,  but  Wald 
(1947)  has  shown  that  the  procedure  operates  in  the  same  way  for  composite  hy¬ 
potheses.  For  example,  suppose  it  is  desirable  to  know  whether  a  student  knew 
more  than  some  proportion,  £j  ,  of  the  items  in  an  item  pool.  In  order  to  use 
the  SPRT  to  make  this  decision,  a  region  for  which  it  does  not  matter  which  de¬ 
cision  is  made  must  first  be  selected  around  p,  say,  p0  <£<£,.  If  £0  is 
close  to  £j  ,  a  very  precise  decision  is  required.  If  pQ  and  pt  define  a  wide 
indifference  region  around  £,  a  rather  gross  decision  rule  is  all  that  is  need¬ 
ed.  The  SPRT  is  then  carried  out  in  exactly  the  same  fashion  as  above,  using  p 
and  jpj  as  the  values  for  hypotheses  H0  and  Hj  Respectively.  When  the  decision 
points  A  and  B  are  computed  as  above,  the  error  rates,  a  and  g,  hold  for  true 
values  of  £  at  £0  and  £j .  For  true  values  of  £  more  extreme  than  £0  or  £t ,  the 
error  rates  are  lower. 

Evaluating  Outcomes 


In  order  to  evaluate  the  properties  of  the  SPRT,  two  functions  have  been 
derived:  the  operating  characteristic  (OC)  function  and  the  average  sample  num¬ 

ber  (ASN)  function.  The  OC  function  is  defined  as  the  probability  of  accepting 
hypothesis  H0  as  a  function  of  the  true  proportion  of  the  item  pool  known  by  the 
student.  Although  the  derivation  of  the  OC  function  is  somewhat  complex,  the 
function  can  be  approximated  by  the  following  two  formulas: 


and 


L(p) 


ft)  -  ft*) 

Ml*  -  ■ 
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M*  -  (Hhrj 


[8] 


These  equations  are  used  by  substituting  various  arbitrary  values  of  h  and  solv¬ 
ing  for  £  and  L<£>.  L(£),  the  probability  of  accepting  H0,  is  then  plotted 
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against  £  to  describe  the  OC  function.  Figure  1  shows  an  OC  function  for  a  * 
.05,  8  -  .10,  £0  ■  .6,  and  £a  -  .8.  Note  that  at  £  «■  £0  the  height  of  the  curve 
is  equal  to  1  -  a,  and  at  £  •  £j,  the  height  of  the  curve  is  equal  to  B.  Note 
that  the  OC  function  is  only  dependent  upon  a,  B>  £g»  and  £j .  Also,  the  steeper 
the  curve,  the  more  accurate  the  SPRT  decision  rule. 


Figure  1 

Example  of  the  OC  and  ASN  Functions 


The  ASN  function  is  defined  as  the  expected  number  of  items  required  to 
make  a  decision  at  the  various  values  of  the  true  proportion  of  known  items, 
E(n|£>.  The  formula  for  the  ASN  function  for  the  binomial  case  described  above 
is 


E{n\p) 


L(p)  InB  ■»  (1  -  £(p))  In  A 

p  ln^  ♦  (i  -  p> 


[91 


where  all  of  the  symbols  are  as  described  above  and  the  logarithns  are  to  the 
base  £.  Figure  1  also  shows  the  ASN  function  for  the  example  presented  above. 
Note  that  the  ASN  function  is  highest  between  the  points  £0  and  £:  and  that  the 
closer  together  the  values  of  £,  and  £j  are,  the  higher  the  curve  in  that  re¬ 
gion.  In  general,  the  lower  the  ASN  curve,  the  more  efficient  the  decision 
rule. 


Application  to  Tailored  Testing 

Although  the  SPRT  as  defined  above  is  a  valuable  procedure  for  decision- 
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making  in  many  situations,  it  makes  an  Implicit  assumption  that  limits  its  use¬ 
fulness  for  tailored  testing.  The  model  as  presented  assumes  that  the  probabil¬ 
ity  of  a  correct  response  is  the  same  for  all  items  in  the  pool.  This  assump¬ 
tion  is  reasonable  if  items  are  randomly  selected  and  p  is  the  proportion  of  the 
items  that  a  student  can  answer  correctly,  but  it  is  not  reasonable  if  items  are 
selected  to  maximize  information  at  an  ability  level.  Under  the  tailored  test¬ 
ing  model  assumed  by  this  paper,  the  probability  of  a  correct  response  changes 
with  each  item,  requiring  a  modification  of  the  model. 

Fortunately,  a  detailed  analysis  of  Wald's  (1947)  work  indicates  that  the 
sequential  random  sample  assumption  is  not  necessary  for  the  application  of  the 
SPRT  but  is  needed  only  for  the  derivation  of  the  OC  and  ASN  functions.  The 
SPRT  can  then  be  directly  applied  to  tailored  testing,  but  the  OC  and  ASN  func¬ 
tions  must  be  determined  in  a  different  manner.  One  approach  to  determining 
these  functions  will  be  presented  later. 

To  demonstrate  the  application  of  the  SPRT  to  tailored  testing  as  defined 
by  this  paper,  suppose  that  a  tailored  test  is  being  used  to  determine  whether  a 
student  has  exceeded  the  criterion  specified  for  a  criterion-referenced  test. 
Although  the  method  for  selecting  this  criterion  is  currently  not  well  speci¬ 
fied,  assume  that  a  value,  8C,  has  been  determined  and  that  students  above  this 

value  on  the  latent  achievement  scale  pass  the  unit,  while  those  below  0C  are 
given  more  instruction. 

In  order  to  use  the  SPRT,  a  region  must  be  specified  around  0C  for  which  it 

does  not  matter  whether  a  pass  or  a  fail  decision  is  made.  If  high  accuracy  is 
desired  for  the  decision  rule,  a  narrow  indifference  region  must  be  specified, 
but  more  items  will  be  required  to  make  the  decision.  As  the  region  gets  wider, 
the  decision  accuracy  declines,  but  fewer  items  are  required.  Values  of  0,  0o, 
and  0i  mark  the  boundaries  of  this  indifference  region  (0O  <  0C  <  0j).  Once 

these  values  have  been  selected,  the  likelihood  ratio  can  be  defined  as 


Hxl ,...,  ®n|e1) 
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where 

LCXj , . . . ,xn| 0^) ,  k  ■  0,  1,  is  the  likelihood  of  the  student's  response 

string  of  n  items  administered  so  far; 
x^  is  the  0,  1  score  on  item  _i; 

P i( *s  t*ie  Pr°bability  of  a  correct  response  to  item  i_  as¬ 
suming  ability  8k  determined  from  the  appropriate  ICC 
model;  and 
Qt(0k)  =  1  -  Pi(0k). 

If  the  1-parameter  logistic  model  is  used  as  a  basis  for  the  tailored  test- 
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ing  procedure.  Equation  10  becomes 
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where  is  the  difficulty  parameter  for  item  _i.  Equation  11  can  be  simplified 
to 
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The  values  of  this  likelihood  ratio  can  then  be  used  to  test  whether  the  student 
is  above  or  below  0C  using  the  same  method  presented  earlier.  If  the  ratio  is 

greater  than  A  =  ~  the  student  is  classified  as  being  above  ®c>  if  it  is 


below  B 


e 


the  student  is  classified  below  the  criterion;  otherwise. 


(1  -  a) * 

another  item  is  administered.  If  the  3-parameter  logistic  model  is  the  basis 
for  the  tailored  testing  procedure,  the  SPRT  procedure  is  applied  in  exactly  the 
same  manner  as  above,  except  that 


W  =  °i  +  (1  -  ci> 
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is  used  in  Equation  10  instead  of  the  simple  logistic  form. 

The  evaluation  of  the  OC  and  ASN  functions  cannot  be  performed  as  easily  as 
for  the  simple  binomial  model  due  to  the  presence  of  the  item  parameters  in  the 
formula  for  computing  the  probability  of  a  correct  response.  Since  the  Item 
parameters  for  the  next  item  to  be  administered  are  dependent  on  the  item  pool 
used  and  on  the  responses  to  the  previous  items,  the  derivation  of  these  func¬ 
tions  depends  on  a  complex  string  of  conditional  expectations.  The  conditional 
probabilities  involved  make  the  derivation  of  these  functions,  for  all  practical 
purposes,  impossible.  Therefore,  the  OC  and  ASN  functions  can  only  be  approxi¬ 
mated  using  simulation  techniques,  but  these  approximations  should  be  adequate 
for  most  purposes.  Some  OC  and  ASN  functions  for  tailored  tests  based  on  the  1- 
and  3-parameter  logistic  models  will  be  presented  later  in  this  paper.  Note, 
however,  that  although  the  full  OC  function  cannot  be  derived,  the  value  of  the 
function  is  equal  to  1  -  a  at  60  and  to  3  at  dlf  assuming  that  the  item  parame¬ 
ters  are  known.  In  reality,  these  two  points  are  not  known  either,  since  in  all 
cases  except  simulations  the  item  parameters  are  only  estimated. 
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Bayesian  Sequential  Decision  Procedure 

The  Bayesian  decision  procedure  is  an  alternative  to  the  SPRT  for  deciding 
whether  or  not  a  student  has  exceeded  the  criterion,  0  .  Although  this  proce¬ 
dure  is  much  more  complicated  than  the  SPRT,  it  has  the  capability  of  using  ad¬ 
ditional  information  in  making  the  decision.  This  added  information  may  improve 
the  decision  process. 

Basic  Concepts 


Initially,  it  is  assumed  that  a  population  of  students  exists  such  that 
each  student  has  some  definable  achievement  level,  0.  Individual  achievement 
levels  are  labeled  6^.  Each  person  is  to  be  tested  and  a  decision  is  to  be  made 

concerning  placement  above  or  below  the  criterion.  The  decision  to  place  above 
the  criterion  score  is  labeled  dj ;  and  the  decision  to  place  below  the  criterion 
score ,  ^ • 

In  order  to  decide  upon  a  decision  rule  using  Bayesian  methodology,  three 
pieces  of  information  are  required  in  advance.  These  are  (1)  a  prior  distribu¬ 
tion  of  0,  (2)  a  loss  function  relating  the  achievement  levels  to  the  decisions, 
and  (3)  the  cost  of  each  observation.  Using  these  three  types  of  information,  a 
decision  rule  (technique  for  selecting  a  decision)  and  a  stopping  rule  (tech¬ 
nique  for  deciding  when  a  decision  should  be  made)  can  be  determined. 

The  basic  concept  used  in  choosing  a  decision  rule  is  the  concept  of  risk. 
Risk  is  defined  as  the  expected  loss,  given  a  decision.  Obviously,  the  decision 
that  minimizes  the  risk  is  the  desired  one.  When  a  Bayesian  prior  is  used,  this 
minimum  risk  is  called  the  Bayes  risk. 

The  stopping  rule  used  with  the  Bayesian  sequential  decision  procedure  is 
also  based  upon  the  Bayes  risk  concept.  If  the  expected  risk  after  taking  an¬ 
other  observation  plus  the  ccr  of  the  observation  is  less  than  the  risk  before 
the  observation  is  taken,  the  sampling  should  go  on.  However,  if  the  expected 
risk  plus  the  cost  of  a  new  observation  is  greater  than  the  risk  without  the 
observation,  then  sampling  should  cease.  In  some  cases,  it  is  best  not  to  take 
any  observations  at  all,  because  the  expected  risk  plus  the  cost  of  an  observa¬ 
tion  is  greater  than  the  initial  risk  of  a  guess  based  on  the  prior  distribution 
of  achievement. 

Based  on  this  framework,  theorems  have  been  proven  showing  that  an  optimal 
procedure  exists  and  that  the  optimal  procedure  will  reach  a  decision  after  some 
finite  number  of  observations  (DeGroot,  1977).  If  the  risk  decreases  with  each 
observation,  the  procedure  is  called  a  regular  sequential  decision  procedure. 
Only  regular  procedures  will  be  considered  here,  since  it  is  assumed  that  each 
item  administered  yields  some  positive  information  rather  than  providing  some 
misinformation. 

Simplified  Example 

Although  this  example  is  not  realistic,  it  demonstrates  the  basic  concepts 
without  requiring  complicated  mathematical  expressions.  The  extension  of  the 
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procedure  Co  realistic  situations  is  direct,  but  the  mathematics  is  cumbersome. 
Suppose  that  two  types  of  individuals  exist  in  the  population  of  interest,  those 
with  *  -.8  and  those  with  8j  *  +.8  on  a  latent  achievement  dimension.  A  tai¬ 
lored  test  is  to  be  used  to  classify  the  individuals  into  two  groups — those 
above  the  criterion  score  0.0  and  those  below.  Thus,  two  decisions  are  possi¬ 
ble:  (1)  classify  as  dj  those  above  the  criterion  and  (2)  classify  as  d^  those 
below  the  criterion. 

If  persons  with  ability  -.8  are  classified  above  the  criterion,  a  loss  of 
25  is  incurred  in  each  case.  If  they  are  classified  below  the  criterion,  there 
is  no  loss.  If  persons  with  ability  +.8  are  classified  above  the  criterion, 
there  is  no  loss,  whereas  a  loss  of  15  is  incurred  for  each  person  classified 
below  the  criterion.  This  loss  function  is  summarized  in  Table  1;  it  should  be 
noted  that  these  loss  function  values  are  totally  arbitrary. 


Table  1 
Loss  Function 


Ability  (8i) 

Decision 

ii  <L 

+.8 

0 

15 

-.8 

25 

0 

Suppose  that  the  prior  belief  that  a  randomly  selected  person  has  ability 
+.8  is  .6  and  the  prior  belief  that  he/she  has  ability  -.8  is  .4.  Then,  the 
first  step  in  using  a  Bayesian  sequential  decision  process  is  to  determine  the 
risk  associated  with  d:  and  d2  when  no  observations  are  taken.  The  expected 
loss  (risk)  if  decision  <ij  is  made  is 

Fdosslij)  =  P(0J  )  H  j  8j  )  +  P(0  2  )  *  (d  J  I  e2  )  [14] 

=  .4  x  2  5  +  . 6  x  o 

=  10, 

where  P(0i)  is  the  prior  probability  of  0^  and  £(dj|0j)  is  the  loss  from  making 
decision  when  0^  is  true.  The  expected  loss  (risk)  if  d^2  Is  made  is 

2?  ( los  s  \d2)  =  P(e1)i(d2|e1)  +  P(e2)a(d2  |e2)  [15] 

=  .4  x  o  +  .6  x  1 5 

=  9. 

Thus,  the  Bayes  decision  when  no  observation  is  taken  is  ci2,  and  the  Bayes  risk 
is  9.  The  decision  <i2  is  obviously  chosen  because  it  has  the  lower  risk. 


Although  the  proper  decision  has  been  determined  for  the  case  when  no  ob¬ 
servations  have  been  taken,  it  has  not  been  determined  whether  or  not  an  obser- 
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vation  should  be  taken.  To  do  that,  the  expected  risk  after  one  observation 
plus  cost  must  be  compared  to  the  Bayes  risk  without  an  observation.  Determin¬ 
ing  the  expected  risk  after  an  observation  requires  several  steps,  the  first  of 
which  is  determining  the  posterior  distribution  of  ability  after  an  observation. 


Suppose  that  an  item  of  0.0  difficulty  is  administered  to  a  person  with 
ability  +.8  or  -.8.  Depending  upon  whether  the  response  is  correct  or  incor¬ 
rect,  a  Bayesian  posterior  can  be  determined  using  Bayes  theorem 


p<e{l*> 


p(x|e.)  P( e.) 

_ 1  i _ i _ 

2 

I  P(ar|e.)  P( 6.) 
i- 1 


1 16  ] 


If  a  correct  response  to  the  item  is  obtained,  the  posterior  probability  of  a 
+.8  ability  is  given  by 


P< .8  |  *  =  1) 


P(1 I . 8 )P( . 8) _ 

P(l|.8)P(.8)  +  P(1 | -. 8)  P(-.8) 


[17] 


The  probabilities  of  an  ability  of  +.8  or  -.8  were  given  in  the  prior  distribu¬ 
tion  as  .6  and  .4,  respectively.  The  probability  of  a  correct  response,  given 
the  known  ability,  can  be  determined  from  the  appropriate  ICC  model.  For  exam¬ 
ple,  using  the  1-parameter  logistic  model, 


P(l| .8) 


,(.8 


1  +  e 


(.8 


0) 

0) 


.69  , 


[18] 


where  P( 1 ] — . 8 )  *  .31.  The  posterior  probability  of  +.8  is  then  P( . 8  1 1 )  =  .77. 
Similarly,  the  posterior  probability  of  -.8  is  P(— . 8 | 1 )  *  .23.  The  posterior 
probability  of  the  +.8  and  -.8  abilities,  given  an  incorrect  response,  can  like¬ 
wise  be  determined  using  Equation  16.  The  posterior  probabilities,  given  an 
incorrect  response,  are  P(.8|0)  *  .37  and  P(-.8|0)  =  .63. 

The  next  step  is  to  determine  the  risk  using  the  posterior  distributions 
just  computed.  If  a  correct  response  is  obtained,  the  expected  loss  for  d^  is 
.23  x  25  +  .77  x  o  »  5.75.  The  expected  loss  for  £2  is  .77  x  15  +  .23  x  0  * 
11.55.  Thus,  if  a  correct  response  is  obtained,  the  Bayes  decision  is  dj  with  a 
Bayes  risk  of  5.75.  If  an  incorrect  response  is  obtained,  the  expected  loss  for 
d^  is  .63  x  25  +  .37  x  0  »  15.75,  while  the  expected  loss  for  c[2  is  .37  x  15  + 
.63  x  0  -  5.55.  Thus,  after  an  incorrect  response,  d^z  is  the  Bayes  decision 
with  a  Bayes  risk  of  5.55. 


Since  it  is  not  known  whether  a  correct  or  incorrect  response  will  be  giv¬ 
en,  the  expected  risk  must  be  computed  regardless  of  the  response.  To  compute 
the  overall  expected  risk,  the  probability  of  a  correct  and  an  incorrect  re¬ 
sponse  is  needed.  The  probability  can  be  obtained  using  the  following  formula: 


P(l)  =  P(l|.8)P(.8)  +  P(1 |-.8)P(-. 8) 


[19] 
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=  .69  x  . 6  +  . 31  x  4 
=  .538 

P(  0)  =  1  -  P(l)  =  .462  . 

The  expected  risk  after  a  response  can  now  be  determined  from 

^(.risk  |  response)  =  £(  loss | l)P(l)  +  £(loss  |  0)P(0)  [20] 

=  5.75  x  .538  +  5.55  x  .462 
=  5.66  . 

At  this  point,  whether  or  not  another  observation  should  be  taken  can  be 
determined.  If  the  expected  loss  after  an  observation  plus  cost  is  greater  than 
the  risk  before  an  observation,  then  administration  of  items  should  cease.  If 
the  risk  before  an  observation  is  taken  is  greater,  then  another  item  should  be 
administered.  In  the  example  given  here,  assume  the  cost  of  a  response  is  1 
unit.  The  expected  loss  after  a  response  plus  cost  is  then  5.66  +  1  =  6.66. 
Since  the  Bayes  risk  with  no  items  administered  was  9,  another  item  should  be 
administered.  Depending  on  the  response  to  the  item,  decision  dt  or  d^  could  be 
selected.  After  the  item  is  administered,  the  appropriate  posterior  becomes  the 
new  prior  and  the  process  continues  as  above.  A  flowchart  of  the  entire  deci¬ 
sion  process  is  presented  in  Figure  2. 

Limitations 


Although  there  are  many  positive  factors  in  the  use  of  the  Bayesian  proce¬ 
dure,  the  very  information  that  makes  the  control  of  the  testing  situation  more 
precise  also  makes  it  difficult  to  implement  initially.  For  example,  specifying 
reasonable  loss  functions  on  the  same  metric  as  the  cost  of  an  observation  is 
difficult  for  most  educational  applications.  What  is  the  cost  of  misclassifying 
persons  below  the  criterion's  score  when  they  really  should  be  classified  above 
it?  Some  attempts  have  been  made  by  this  author  to  specify  loss  functions  for 
tailored  testing  applications,  but  no  satisfactory  results  have  been  obtained  so 
far. 


A  second  difficulty  in  the  application  of  this  procedure  is  in  specifying 
the  prior  distribution  of  achievement  for  a  group.  This  is  not  as  serious  a 
problem  as  determining  loss  functions,  since  performance  data  are  usually  avail¬ 
able  from  previous  groups.  Of  course,  the  more  accurate  the  prior  distribution, 
the  more  accurate  the  decision  based  on  the  procedure. 

It  should  be  realized  that  the  procedure  presented  here  is  a  simplification 
of  a  procedure  that  would  be  used  for  actual  tailored  testing  applications. 
Achievement  levels  are  usually  continuous  rather  than  discrete,  as  presented 
here;  and  the  loss  due  to  an  incorrect  decision  is  a  function  of  the  person's 
distance  from  the  criterion  score  rather  than  a  constant  value.  The  procedure 
can  also  be  modified  by  changing  the  cost  of  observations  with  increasing  test 
length  to  allow  for  fatigue  effects.  Unfortunately,  the  Bayesian  decision  pro- 
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Figure  2 

Flowchart  of  Bayesian  Decision  Process 


cedure  as  described  here  has  not  yet  been  implemented  in  conjunction  with  an 
operational  tailored  testing  procedure.  Plans  are  being  developed,  however,  to 
evaluate  an  operational  version  at  the  Tailored  Testing  Research  Laboratory  at 
the  University  of  Missouri. 


Research  Design 


The  purposes  of  this  research  were  (1)  to  obtain  information  on  how  the 
SPRT  procedure  functioned  when  items  were  not  randomly  sampled  from  the  item 
pool;  (2)  to  gain  experience  in  selecting  the  bounds  of  the  indifference  region, 
0O  and  9j ;  and  (3)  to  obtain  information  on  the  effects  of  guessing  on  the  accu¬ 
racy  of  classification  when  the  1-parameter  logistic  model  was  used. 
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Tailored  Testing  Procedure 

To  determine  the  effects  of  these  variables,  the  computation  of  the  SPRT 
was  programmed  into  both  the  1-  and  3-parameter  logistic  tailored  testing  proce¬ 
dures  that  were  operational  at  the  University  of  Missouri-Columbia.  Since  these 
procedures  have  been  described  in  detail  previously  (Koch  &  Reckase,  1978),  they 
will  be  merely  summarized  here.  The  programs  implementing  both  models  used  a 
fixed  stepsize  method  for  branching  through  an  item  pool  until  both  a  correct 
and  an  incorrect  response  had  been  given.  After  that  point,  all  ability  esti¬ 
mates  were  obtained  using  an  empirical  maximum  likelihood  estimation  procedure. 
Items  were  selected  for  both  models  to  maximize  the  item  information  at  the  pre¬ 
vious  ability  estimate. 

To  evaluate  the  decision-making  power  of  the  SPRT,  subjects  with  known 
ability  were  needed.  Therefore,  a  simulation  routine  was  built  into  the  tai¬ 
lored  testing  program  in  place  of  the  responding  live  examinee.  At  the  begin¬ 
ning  of  each  simulation  run,  the  true  ability  of  the  simulated  examinee  was  in¬ 
put  into  the  program.  This  value  was  used  to  determine  the  true  probability  of 
a  correct  response  to  the  administered  items  based  on  the  model  used  (1-  or 
3-parameter  logistic)  and  the  estimated  item  parameters.  A  number  was  then  ran¬ 
domly  selected  from  a  uniform  distribution  in  the  range  from  0  to  1.  If  the 
randomly  selected  number  was  less  than  or  equal  to  the  probability  of  a  correct 
response,  the  item  was  scored  as  correct.  If  the  randomly  selected  number  was 
greater  than  the  probability  of  a  correct  response,  the  item  was  scored  as  in¬ 
correct.  This  procedure  continued  for  each  item  in  the  tailored  test. 

Tailored  tests  were  simulated  25  times  at  each  true  ability  using  different 
seed  numbers  for  the  random  number  generator.  True  abilities  from  -3  to  +3  at 
.25  intervals  were  used  for  both  the  1-  and  3-parameter  models  to  evaluate  the 
performance  of  the  SPRT.  In  addition,  simulations  were  run  on  a  composite  pro¬ 
cedure  in  which  tailored  test  procedure  and  the  probability  ratio  calculations 
(Equation  11)  were  based  on  the  I-parameter  model,  but  the  item  responses  were 
determined  by  using  the  3-parameter  model.  This  was  done  to  determine  the  ef¬ 
fects  of  guessing  on  correct  classification  using  the  1-parameter  logistic  mod¬ 
el. 

Criterion  Values 

In  computing  the  probability  ratios,  three  sets  of  limits  of  the  indiffer¬ 
ence  regions  were  used:  +.3,  +.8,  +1.  A  criterion  of  0C  *  0  was  assumed  in  all 

cases.  The  ratios  were  computed  after  each  item  was  administered,  and  the  re¬ 
sults  were  compared  to  an  A  value  of  45  and  a  B  value  of  .102.  These  were  de¬ 
termined  based  on  a  =  .02  and  3  =  .10.  A  classification  was  made  the  first  time 
these  limits  were  exceeded.  If  the  limits  were  not  exceeded  before  20  items  had 
been  administered  (an  arbitrary  upper  limit  on  test  length),  the  values  above 
1.0  were  classified  as  above  0C  and  the  values  below  1.0  were  classified  as  be¬ 
low  0  .  This  is  called  a  truncated  SPRT.  At  each  true  ability  used  for  the 
simulation,  the  proportion  of  the  25  administrations  classified  below  0C  and  the 

average  number  of  items  administered  were  computed.  Plots  of  these  values 
against  the  true  abilities  approximate  the  OC  and  ASN  functions,  respectively. 
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These  plots  were  made  for  each  combination  of  indifference  region  and  tailored 
testing  method,  yielding  nine  plots  of  the  OC  and  ASN  functions. 

Item  Pools 


Two  different  item  pools  were  used  for  this  study.  For  the  analyses  using 
just  the  1-parameter  or  the  3-parameter  model,  an  existing  pool  of  72  vocabulary 
items  were  used.  This  item  pool  had  an  approximately  normal  distribution  of 
difficulty  parameters.  For  the  1-parameter  tailored  test  using  3-parameter  re¬ 
sponses,  an  item  pool  with  181  items,  rectangularly  distributed  between  -3  and 
+3  in  difficulty  was  used.  These  simulated  items  had  constant  discrimination 
parameters  of  .588  (this  value  yields  a  1.0  when  multiplied  by  D  •  1.7)  and  a 
pseudo-guessing  parameter  of  .12.  This  simulated  item  pool  was  selected  over 
the  real  vocabulary  pool  to  have  better  control  over  the  guessing  parameters. 

The  1-parameter  procedure  used  only  the  b-values  from  the  pool. 

Results 


1-Parameter  Model 


Figure  3  shows  the  OC  functions  for  the  1-parameter  logistic  model  based  on 
the  vocabulary  item  pool.  The  figure  shows  three  graphs,  one  for  each  of  the 
+.3,  +.8,  and  +1  indifference  regions.  Note  that  the  curves  are  similar  regard- 


Figure  3 

One-Parameter  OC  Functions 
for  Three  Indifference  Regions 


less  of  the  indifference  region.  The  data  indicate  that  in  all  three  cases  the 
classification  accuracy  was  nearly  the  same. 


The  values  of  the  curves  at  the  limits  of  the  Indifference  region  give  fur¬ 
ther  evaluative  information.  At  the  lower  point  the  OC  function  should  pass 
through  1  -  a.  At  the  -.3  value  the  curve  is  in  fact  .85  when  it  should  be  .98, 
showing  the  degrading  effects  of  restrictive  stopping  rules  used  by  the  tailored 
testing  procedure.  At  the  -.8  and  -1  points  for  the  corresponding  curves,  the 
results  are  about  as  expected,  being  .94  and  1.00  rather  than  .98. 

At  the  upper  limit  of  the  indifference  region,  the  OC  function  should  have 
a  value  of  .1.  For  the  +.3  case  it  is  in  fact  .5  rather  than  .1,  again  showing 
the  effects  of  truncating  the  procedure.  At  the  values  i  i  +.8  and  +1  the  values 
of  the  OC  function  were  near  or  better  than  what  they  should  have  been,  based  on 
the  theoretically  expected  results. 

The  ASN  functions  for  the  1-parameter  model  are  given  in  Figure  4.  The 
curves  plotted  correspond  to  the  ASN  functions,  using  indifference  regions  for 
+.3,  +.8,  and  +1.  It  can  immediately  be  seen  that  there  was  a  substantial  dif¬ 
ference  in  the  average  number  of  items  needed  to  reach  a  decision,  with  the 
greatest  number  required  when  the  indifference  region  was  narrowest.  It  can 
also  be  seen  that  the  largest  expected  number  of  items  was  near  the  criterion 
score  0.0  and  that  the  average  number  dropped  off  at  the  extreme  abilities.  The 
slight  lack  of  symmetry  in  the  curves  is  due  to  the  fact  that  a  was  not  equal  to 
g.  For  abilities  beyond  +1,  an  average  of  only  about  3  to  5  items  was  needed 
for  classification  for  the  wider  regions,  but  6  to  11  items  were  needed  for  the 
+.3  indifference  region.  Note  that  the  +.3  curve  approached  the  arbitrary 
20-item  limit  for  the  tailored  tests. 


Figure  4 

One-Parameter  ASN  Functions 
for  Three  Indifference  Regions 
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Figure  5  shows,  for  comparison  purposes,  the  theoretical  curves  for  the  ASN 
and  OC  functions  based  on  the  +.3  indifference  region.  An  infinite  number  of 
items  with  difficulty  0.0  was  assumed  for  the  theoretical  functions,  and  the 
tests  were  assumed  to  have  no  upper  limit  on  the  number  of  items  administered. 

A  comparison  of  Figures  3  and  U  with  Figure  5  shows  that  the  OC  curve  for  the 
theoretical  function  is  steeper  at  the  cutting  point  than  the  simulated  curves, 
and  that  the  ASN  function  is  substantially  higher.  The  difference  in  the  theo¬ 
retical  and  simulated  OC  curves  shows  the  effect  of  the  20-item  stopping  rule 
and  the  selection  of  items  of  differing  difficulty. 


Figure  5 

Theoretical  OC  and  ASN  Functions 


3-Parameter  Model 


The  results  of  the  simulation  of  the  3-parameter  logistic  tailored  test  are 
given  in  Figures  6  and  7.  Figure  6  presents  the  OC  functions  for  the  3-parame- 
ter  model,  again  using  the  indifference  regions  of  +.3,  +.8,  and  +1.  Notice 
that  as  with  the  1-parameter  model,  the  OC  curves  are  fairly  similar  for  the 
three  indifference  regions  throughout  most  of  the  range  of  ability.  However, 
there  are  discrepancies  for  the  +1  indifference  range  curve  near  the  +1  and  -1 
points,  indicating  a  decline  in  decision  precision  for  that  region.  At  the  -.3 
value  for  the  +.3  indifference  range,  the  value  of  the  curve  is  .96,  fairly 
close  to  the  .98  theoretical  value.  At  the  upper  end  (+.3),  however,  the  value 
is  .2  instead  of  the  .1  value  that  it  should  be.  This  may  show  the  effects  of 
guessing  on  the  decision  process.  The  +.8  and  +1  indifference  regions  again 
yield  better  error  probabilities  than  would  be  expected  from  the  theory. 

The  A3N  function  for  the  3-parameter  model  (Figure  7)  also  shows  similar 
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Figure  6 

Three  Parameter  OC  Functions 
for  Three  Indifference  Regions 


results  to  those  obtained  from  the  1-parameter  model.  The  +.3  indifference  re¬ 
gion  required  the  greatest  number  of  items,  while  +.8  and  +1.0  required  about 
the  same  number.  As  before,  the  largest  number  was  required  near  the  criterion 
score.  However,  with  the  3-parameter  model  far  fewer  items,  on  the  average, 
were  required  to  make  a  decision  than  for  the  1-parameter  model.  Of  special 
note  is  the  ASN  value  of  about  1.0  in  the  -1  to  -3  range  on  the  ability  scale. 
Decisions  seem  to  be  possible  with  very  few  items  in  that  range. 

Because  of  the  guessing  component  of  the  3-parameter  logistic  model,  the 
ASN  function  tended  to  yield  more  asymmetric  results  than  the  1-parameter  model. 
More  items  were  required  when  classifying  high  than  when  classifying  low  to  com¬ 
pensate  for  the  nonzero  probability  of  a  correct  response.  Also,  the  ASN  curve 
for  the  +.3  indifference  region  was  much  more  peaked  than  its  1-parameter  coun¬ 
terpart.  If  the  simulated  curves  for  the  3-parameter  model  are  compared  to  the 
theoretical  curves  presented  in  Figure  5,  the  OC  functions  can  be  seen  to  match 
the  theoretical  functions  fairly  closely,  while  the  ASN  functions  show  that  sub¬ 
stantially  fewer  items  were  required.  Over  much  of  the  ability  range,  as  many 
as  10  times  more  items  were  specified  by  the  theoretical  ASN  curve  when  unlimit¬ 
ed  identical  items  were  assumed.  However,  it  should  be  noted  that  the  theoreti¬ 
cal  curves  are  based  on  the  1-parameter  model. 

Effect  of  Guessing  on  the  1-Parameter  Model 


Figure  8  shows  the  OC  functions  for  the  1-parameter  model  when  the  3-param- 


Figure  7 

Three  Parameter  ASN  Functions 
for  Three  Indifference  Regions 
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achievement  (e) 

eter  model  was  used  to  determine  the  responses.  The  figure  shows  three  graphs, 
one  for  each  of  the  +.3,  +.8,  and  +1  indifference  regions.  Note  that  the  curves 
are  fairly  similar  regardless  of  the  indifference  region  but  that  they  are 
shifted  substantially  to  the  left  compared  to  the  previous  OC  curves.  This  in¬ 
dicates  that  the  probability  of  classifying  a  person  below  0£  has  dropped  off 

substantially  until  an  ability  of  about  -2  has  been  reached.  In  other  words,  it 
is  much  easier  to  be  classified  above  the  criterion  score  with  this  procedure 
than  when  guessing  does  not  enter  into  the  decision.  Instead  of  being  at  zero, 
the  effective  criterion  has  been  shifted  down  to  -1.5.  Clearly,  the  values  of 
the  OC  function  at  the  limits  of  the  indifference  region  are  entirely  different 
from  the  theoretical  values. 

The  ASN  functions  for  the  three  indifference  regions— f. 3,  +.8,  and  +1 — are 
shown  in  Figure  9.  The  difference  between  these  graphs  and  those  presented  in 
Figure  4  are  that  the  curves  are  higher  (more  items  were  required)  and  the  high¬ 
est  point  of  the  curve  is  shifted  to  the  steepest  part  of  the  OC  curve.  The 
relationship  between  the  height  of  the  ASN  function  and  the  width  of  the  indif¬ 
ference  region  still  holds;  however,  as  the  region  gets  wider,  the  average  num¬ 
ber  of  items  decreases. 

Summary  and  Conclusions 

The  purpose  of  this  paper  has  been  to  describe  two  procedures  for  making 
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Figure  8 

Composite  OC  Functions 
for  Three  Indifference  Regions 


binary  classification  decisions  using  tailored  testing — the  sequential  probabil¬ 
ity  ratio  test  (SPRT)  and  a  Bayesian  decision  procedure — and  to  present  some 
simulation  data  showing  the  characteristics  of  the  operation  of  the  SPRT  for  two 
ICC  models.  The  first  procedure  described,  the  SPRT,  was  developed  by  Wald  for 
quality  control  work.  It  has  not  been  widely  applied  for  testing  applications 
because  the  assumption  of  an  equal  probability  of  a  correct  response  was  made  to 
facilitate  the  derivation  of  the  operating  characteristic  (OC)  and  average  sam¬ 
ple  number  (ASN)  functions.  Since  this  assumption  can  only  be  met  for  testing 
applications  by  randomly  sampling  items  for  administration,  the  procedure  has 
not  been  used  with  tailored  testing.  In  this  paper  the  probability  of  a  correct 
response  was  allowed  to  vary  from  item  to  item,  although  it  made  the  derivation 
of  the  OC  and  ASN  functions  impossible.  Simulation  procedures  were  then  used  to 
estimate  these  functions. 

The  SPRT  procedure  described  is  operational  at  the  Tailored  Testing  Re¬ 
search  Laboratory  of  the  University  of  Mis  sour i-Columbia  in  two  forms:  a  live 
tailored  testing  procedure  and  a  simulated  procedure.  The  results  of  the  appli¬ 
cation  of  the  simulation  procedure  to  three  studies  were  described  in  this  pa¬ 
per.  The  first  study  estimated  the  OC  and  ASN  functions  for  a  1-parameter  lo¬ 
gistic  based  tailored  testing  procedure  in  which  the  size  of  the  indifference 
region  around  the  criteron  score  was  varied.  The  results  of  the  study  showed 
that  the  average  number  of  items  needed  for  classification  was  quite  low  when 
the  true  ability  of  a  simulated  person  was  not  too  close  to  the  criterion  score 
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Figure  9 

Composite  ASN  Functions 
for  Three  Indifference  Regions 


ACHIEVEMENT  (o) 

and  that  the  width  of  the  indifference  region  did  not  greatly  affect  the  OC 
function.  The  width  of  the  Indifference  region  did  have  a  substantial  effect  on 
the  ASN  function.  The  accuracy  of  classification  of  the  simulated  tailored  test 
was  not  quite  as  good  as  administering  a  large  number  of  items  with  difficulty 
values  equal  to  the  criterion  score.  This  result  was  explained  by  the  arbitrary 
20-item  limit  imposed  on  the  tailored  test  and  by  the  variation  in  the  diffi¬ 
culty  parameters  of  the  items  administered. 

The  second  study  estimated  the  OC  and  ASN  functions  for  a  3-parameter  lo¬ 
gistic  tailored  testing  procedure,  also  varying  the  size  of  the  indifference 
region.  The  results  were  similar  to  those  for  the  1-parameter  model,  but  even 
fewer  items  were  generally  needed  for  classification.  The  results  of  these 
first  two  studies  both  indicated  that  the  SPRT  could  be  successfully  applied  to 
tailored  testing. 

The  third  simulation  study  estimated  the  OC  and  ASN  functions  for  the  1-pa¬ 
rameter  model  when  guessing  was  allowed  to  enter  into  the  responses  to  the  items 
administered.  The  results  showed  that,  in  effect,  guessing  lowered  the  criteri¬ 
on  score,  making  it  easier  to  classify  an  examinee  above  the  criterion  and  rais¬ 
ing  the  average  number  of  items  needed  for  classification.  This  spurious  shift 
in  the  criterion  greatly  increased  the  error  rates  in  classification.  The  ef¬ 
fect  was  strong  enough  to  preclude  the  use  of  the  1-parameter  model  for  classi¬ 
fication  decisions  when  guessing  is  a  factor. 
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The  second  decision  procedure  described  in  this  paper  allows  the  use  of  a 
greater  amount  of  information  in  making  a  decision  than  the  SPRT.  The  Bayesian 
procedure  includes  a  prior  distribution  of  student  achievement,  a  loss  function 
for  incorrect  decisions,  and  the  cost  of  observations  in  the  development  of  the 
decision  rule.  The  basic  philosophy  of  this  procedure  is  to  administer  items 
until  the  expected  loss  incurred  in  making  a  decision  is  less  than  the  expected 
loss  after  the  next  item  is  administered  plus  the  cost  of  administration.  At 
that  point  a  decision  is  made  that  minimizes  the  expected  loss.  The  Bayesian 
procedure  is  described  in  detail,  and  a  simple  example  is  given  of  its  use.  The 
Bayesian  procedure  is  not  yet  operational  for  making  decisions  under  tailored 
testing  because  appropriate  loss  functions  for  educational  decisions  have  not 
been  determined.  However,  simulation  studies  of  the  procedure  will  commence  in 
the  near  future. 

Both  of  the  decision  procedures  described  in  this  paper  show  promise  for 
use  in  tailored  testing.  Both  also  require  substantial  research  effort  before 
they  can  be  applied  with  confidence.  It  is  hoped  that  this  paper  will  help  to 
stimulate  that  research. 
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A  Model  for  Computerized  Adaptive  Testing  Related  to 
Instructional  Situations 
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The  present  study  involved  the  formulation  and  evaluation  by  computer  simu¬ 
lation  of  a  model  for  computer-based  adaptive  testing  related  to  instructional 
or  training  situations.  Specifically,  the  model  adresses  tests  composed  of 
items  corresponding  to  hierarchically  related  instructional  objectives.  The 
purpose  of  the  endeavor  was  to  formulate  and  to  analyze  a  model  that  would  re¬ 
duce  testing  time  without  compromising  the  necessary  level  of  accuracy  in  deci¬ 
sions  regarding  the  mastery  or  nonmastery  of  objectives. 

The  adaptive  testing  model  developed  in  this  study  combines  the  models  of 
Ferguson  (1969,  1970)  and  Kalisch  (1974a,  1974b).  Ferguson's  procedure  employs 
the  Wald  probability  ratio  test  (Wald,  1947,  1973)  to  determine  mastery/nonmas¬ 
tery  of  hierarchically  related  objectives.  Kalisch's  procedure  employs  a  pro¬ 
cess  that  predicts  item  responses  based  upon  prior  examinees'  data.  For  the 
present  study  a  combination  of  obtained  and  predicted  item  responses  was  used 
with  the  Wald  binomial  probability  ratio  test  and  hierarchical  configurations  of 
objectives  to  ascertain  each  examinee's  mastery/nonmastery  of  objectives. 

The  Adaptive  Testing  Model 


Configuration  and  Relative  Importance  of  the  Objectives 

A  hierarchical  configuration  of  objectives,  such  as  in  Figure  1,  defines 
the  interrelationship  of  the  objectives  to  be  mastered  by  each  trainee.  Objec¬ 
tive  5  has  Objectives  2  and  3  as  its  immediate  subordinates  or  prerequisites. 
This  means  that  mastery  of  the  skill  or  competency  represented  by  Objective  3 
requires  that  both  Objectives  2  and  3  be  mastered.  Nonmastery  of  either  or  both 
Objectives  2  and  3  implies  nonmastery  of  Objective  5.  The  figure  indicates  no 
prerequisite  to  Objective  2.  Objective  1  is  prerequisite  to  both  Objectives  3 
and  4.  The  immediate  prerequisites  to  Objective  6  are  Objectives  2,  3,  and  4. 

No  prerequisites  are  indicated  for  Objective  7. 

Generally,  some  objectives  are  considered  more  important  or  critical. 

Other  objectives  may  be  subordinate  or  prerequisite  to  the  former  objectives-- 
those  of  primary  concern.  If  mastery  can  be  ascertained  for  the  "objective  of 
primary  concern,"  then  there  appears  to  be  little,  if  any,  need  to  assess  per¬ 
formance  on  the  subordinate  objectives.  If  direct  assessment  of  performance  on 
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Figure  1 

Hypothetical  Hierarchical  Configuration  of  Objectives 
(*  Indicates  an  "Objective  of  Primary  Concern.") 


a' l  the  objectives  was  desired,  then  every  objective  would  be  identified  as  an 
objective  of  primary  concern. 

The  model  assumes  that  mastery  of  an  objective  implies  mastery  of  all  its 
immediate  subordinate  objectives;  nonmastery  of  an  objective  implies  neither 
mastery  nor  nonmastery  of  the  immediate  subordinates.  Mastery  classification  on 
an  objective  of  primary  concern  results  in  an  assumption  that  all  the  immediate¬ 
ly  prerequisite  or  subordinate  objectives  are  mastered,  unless  a  subordinate  is 
also  of  primary  concern.  Nonmastery  classification  on  an  objective  of  primary 
concern  results  in  testing  each  immediate  subordinate  as  if  it  were  also  an  ob¬ 
jective  of  primary  concern. 

Basing  Decisions  on  a  Data  Base 

The  decisions  made  in  the  adaptive  testing  process  are  dependent  upon  in¬ 
formation  collected  from  prior  examinees.  Although  the  existing  model  assumes 
that  each  prior  examinee  has  answered  all  the  items  for  each  objective,  it  could 
accommodate  a  data  base  consisting  of  responses  by  prior  examinees  to  overlap¬ 
ping  subsets  of  item  pools.  Decisions  such  as  selection  of  items  for  presenta¬ 
tion  and  prediction  of  correctness/incorrectness  of  item  responses  are  made  on 
the  basis  of  the  interrelation  of  item  responses  by  prior  examinees  whose  re¬ 
sponse  patterns  match  the  present  examinee's  pattern.  For  each  item  response 
obtained  from  an  examinee  using  the  adaptive  test,  a  smaller  subset  of  prior 
subjects'  data  is  used  to  make  decisions — a  subset  of  examinees'  dichotomously 
scored  responses  exactly  like  the  present  examinee's  response  pattern. 
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Two  response-matching  procedures  were  defined.  With  the  first  method  a 
vector  t  of  dichotomuously  scored  responses  is  generated  for  an  examinee;  for 
each  additional  response  collected  within  a  test,  the  ?  vector  increases.  The 
individual's  if  vector  is  matched  with  sets  of  responses  in  the  data  base;  but 
only  data  base  sets  with  exactly  the  same  ?  vector  (the  same  pattern  of  "l's" 
and  "0's"  to  exactly  the  same  questions  answered  by  the  examinee)  are  consid¬ 
ered.  With  the  second  method,  not  only  is  the  ^  vector  used,  but  also  an  r  vec¬ 
tor  of  mastery/nonmastery  classifications  for  objectives  is  employed.  Only  data 
base  sets  with  exactly  the  same  ?  and  ?  vectors  are  considered.  With  both  meth¬ 
ods  the  matching  procedure  provides  the  subset  of  data  base  entries  that  is  used 
for  making  predictions  and  selecting  other  items  for  presentation. 

Predicting  item  response  correctness/incorrectness.  Based  upon  the  dichot- 
omously  scored  responses  to  items  presentc  to  an  examinee,  conditional  proba¬ 
bilities  for  answering  the  item  correctly  or  incorrect’ y  are  determined  on  the 
basis  of  response  patterns  in  the  data  base  matching  the  examinee's.  If  either 
conditional  probability  exceeds  prespecified  levels,  the  correctness/incorrect¬ 
ness  of  the  examinee's  expected  response  is  assumed. 

Selection  of  items  for  presentation.  Based  upon  an  examinee's  response 
pattern  and  the  subset  of  the  data  base  response  matching  the  examinee’s,  items 
that  are  expected  to  provide  the  most  information  about  the  objectives  of  prima¬ 
ry  concern  are  selected  for  presentation.  Two  selection  criteria  were  investi¬ 
gated  in  this  study:  item-objective  agreement  and  inter-item  agreement.  For 
each  method  a  coefficient  was  computed  for  each  item  not  presented  and  for  which 
prediction  of  correctness/incorrectness  had  not  yet  occurred.  The  item  with  the 
highest  coefficient  was  presented  to  the  examinee. 

For  the  item-objective  method,  a  coefficient  of  agreement  between  item  i 
and  the  n_  objectives  of  primary  concern  was  calculated  as  follows: 

(i;0x  ,02  ,  . . . .O^lr.s)  = 

n  _v  ^ 

[{  Z  [Prob(0u  =  1) | (? , s , i  =  1)]  [Prob(£  =  l)|r,s,]} 

U  =  1 

ft  ^ 

+  [  Z  [  Pr  ob  (0^  =  0)  |  (r ,  s  ,  i  =  0)][Prob(-t  =  0)|?,  s]}]/n  [1] 

u  =  1 

where 

i  is  the  item  under  consideration; 

Oj,  02,...,C>n  are  the  n_  object ives  of  concern; 

i  =  1  means  item  i  is  answered  correctly; 
i=0  means  item  T  is  answered  incorrectly; 

0U  =  1  means  objective  u  is  mastered; 

0U  =  0  means  objective  u_  is  not  mastered; 
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~t  is  the  vector  of  objective  mastery/nonmastery 
classifications  for  the  examinee;  and 
is  the  vector  of  the  examinee's  dichotomously 
scored  item  responses. 

For  the  inter-item  method  a  coefficient  of  agreement  between  item  i_  and  the 
n  other  items  corresponding  to  the  objectives  of  concern  is  computed  according 
to  the  following  formula: 

n  +  + 

,ij  ,  .  .  . >i„)  -  [(  £  [ P  r  o  b ( £  .  -  l)|(r,s»£  -  1)] 

3- 1  3 

+  n 

X  [Prob(£  =  l)|r,s]}  +  {  Z  [Prob(£.  =  0)|(r,s,i  =  0)] 

J=1  3 

X  [Prob(£  =  0)jr, s]}/n  [2] 

where 

i_.  =  1  is  the  probability  of  answering  item  ij  correctly; 

i_j  =  0  is  the  probability  of  answering  item  correctly; 

r  is  the  objective  mastery-nonmastery  pattern  for  the 
examinee;  and 

"s  is  the  item  response  pattern  (correct/incorrect )  for 
the  examinee. 

Examinee  response  inconsistencies.  "Untrue"  responses  by  an  examinee  are 
those  responses  that  do  not  agree  with  the  examinee's  "true"  response  (the  exam¬ 
inee's  response  that  is  not  arrived  at  by  guessing  and  has  not  been  erroneously 

selected  or  created).  "Untrue"  responses  are  expected  to  occur  in  such  cases  as 

1.  Selecting  the  correct  answer  by  guessing,  when  in  actuality  the  examin¬ 
ee  should  have  answered  the  item  incorrectly; 

2.  Providing  an  incorrect  answer  because  of  misinterpretation  of  part  of 
the  question;  and 

3.  Pressing  an  unintended  key  on  a  terminal  keyboard. 

Item  responses  that  are  provided  by  an  examinee,  but  are  contrary  to  the 
examinee's  "true"  response,  introduce  potential  measurement  error  into  any  test¬ 
ing  process.  In  the  adaptive  test  model,  erroneous  responses  introduce  error 
into  the  item  response  vector.  Vector  affects  predictions  of  other  item 
responses  and  selection  of  items  for  presentation.  Generally,  it  is  expected 
that  item  prediction  errors  will  affect  the  accuracy  of  the  system,  whereas  er¬ 
rors  in  item  selection  will  reduce  the  efficiency  of  the  system.  Prediction  and 
selection  errors  may  occur,  since  the  adaptive  testing  process  relies  on  match¬ 
ing  the  examinee's  ^  with  exactly  the  same  response  vectors  in  the  data  base. 
Errors  introduced  into  t  would  produce  a  comparison  between  the  examinee's  per¬ 
formance  and  the  wrong  subset  of  prior  examinees.  Even  if  some  of  the  response 
sets  in  the  data  base  contain  the  same  errors  as  those  made  by  the  present  exam- 


inee,  it  would  be  expected  that  for  each  item  the  majority  of  prior  examinees 
had  provided  responses  that  concur  with  their  "true"  responses.  Hence,  errors 
introduced  into  the  examinee's  item  response  vector  would  be  expected  to  compare 
the  examinee's  performance  to  an  inappropriate  subset  of  prior  examinees. 

The  adaptive  testing  model  included  an  optional  component  that  checks  for 
potentially  "untrue"  responses  by  comparing  the  examinee's  inter-item  response 
consistency  to  the  inter-item  response  consistency  demonstrated  by  all  prior 
examinees  whose  data  are  included  in  the  data  base.  When  this  option  was  se¬ 
lected,  it  was  necessary  that  at  least  two  items  be  presented  for  the  examinee’s 
responses  prior  to  making  predictions  or  to  making  other  item  selections  based 
on  the  item  response  vector  s .  The  present  model  requires  that  a  set  of  items 
be  independently  selected  and  presented.  In  this  study  the  number  of  items  pre¬ 
sented  was  sufficient  so  that  the  probability  of  answering  all  of  them  correctly 
by  chance  alone  was  less  than  or  equal  to  .5. 

The  purpose  of  obtaining  responses  to  a  set  of  independently  selected  items 
was  to  determine  whether  the  examinee  has  demonstrated  sufficient  consistency  in 
his/her  response  pattern  to  warrant  this  pattern  serving  as  the  item  response 
vector.  A  coefficient  of  relative  interrelationship  Rx  between  item  x  and  all 

other  items  for  which  responses  have  been  obtained  was  computed  as  follows: 

Z  G  ( x  , i) 

i  [3] 


where 


and 


Z  I (x ,i) 
i 


G(x,i) 


1  if  both  responses  to  item  x  and  item  i  were  correct 
or  if  both  responses  were  incorrect 

0  if  one  response  was  correct  and  the  other  was  wrong. 


I(x,i)  =  { [ Z  Frob(£  =  l|x  =  1)]  x  Prob(x  =  1) } 
+  (Prob(i  =  0|x  =  0)  x  Prob  (x  =  0)}  . 
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G(x,  i)  was  computed  on  the  basis  of  the  examinee's  responses  to  item  x  and  all 
the  other  items  presented. 


Rx  indicates  the  examinee's  consistency  as  compared  to  prior  examinees' 

consistency.  It  is  possible  that  a  given  examinee  demonstrated  greater  consis¬ 
tency  than  prior  examinees,  but  when  the  examinee's  consistency  was  less  than 
that  for  prior  examinees,  his/her  item  response  pattern  contained  "untrue"  re¬ 
sponses.  In  this  study  the  criterion  for  sufficiently  consistent  responses  by 
an  examinee  required  that  for  each  item  x_,  >_  .90.  If  the  criterion  was  not 

attained  for  each  item,  the  item  with  the  lowest  Rx  value  was  temporarily  re¬ 
moved  from  consideration  as  a  member  of  the  item  response  vector  Prior  to 
making  decisions  based  on  t ,  the  item  response  vector  must  contain  at  least  the 
required  minimum  number  of  elements  (equal  to  the  number  of  items  to  be  answered 
to  insure  that  the  probability  of  guessing  the  correct  answers  is  less  than  the 
criterion).  If  s  contained  fewer  elements,  other  items  must  be  independently 
selected  s.  Whenever  the  number  of  elements  in  s  equaled  or  exceeded  the  mini¬ 
mum  requirement,  item  selections  and  predictions  were  based  upon  t.  After  the 
presentation  of  each  additional  item,  all  items  for  which  responses  were  ob- 
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tained  were  included  in  the  calculations  of  the  Rx  values.  Hence,  although  an 

item  response  may  be  questioned  and  not  included  in  ?,  a  future  recalculation 
may  indicate  the  item  response  to  be  consistent  with  the  examinee's  other  re¬ 
sponses.  Likewise,  items  once  contained  in  s  may  be  excluded  on  a  future  recal¬ 
culation. 


For  an  objective  of  primary  concern,  the  dichotomously  scored  results  to 
all  its  items  for  which  correctness/incorrectness  has  been  determined  or  pre¬ 
dicted  were  used  with  the  Wald  probability  ratio  test. 

For  example,  suppose  that  for  an  objective,  responses  were  obtained  to 
three  items  and  predictions  were  made  for  six  other  item  responses.  These  nine 
responses  (correct/incorrect  for  each  item)  were  then  used  in  the  following  for¬ 
mula: 


where 

R  =  number  of  items  answered  (or  predicted  as  being 
answered)  correctly; 

N  =  number  of  items  (number  presented  plus  the  number  predicted) ; 

Cf=  the  critical  nonmastery  score  (difficulty  of  the 
objective  for  nonmasters); 

Cp=  the  critical  mastery  score  (difficulty  of  the  objective 
for  masters). 

Mastery/nonmastery  classifications  were  determined  by  comparing  the  value 
of  S  to  ratios  involving  a  and  3  (Type  I  and  Type  II  errors);  a  is  the  error 
associated  with  falsely  classifying  an  examinee  as  a  nonmaster,  and  3  is  the 
error  of  falsely  classifying  an  examinee  as  a  master: 


1. 


If  S  >  log,  0 


1-3 

a 


the  objective  was  not  mastered. 


ot 

2.  [f  S  <  log1Q  j-_-g  ,  the  objective  was  mastered. 

3.  If  neither  of  the  above  conditions  was  true,  no  mastery/nonmastery 
classification  was  possible  (and  additional  item  responses  were  neces¬ 
sary). 

The  model  assumes  that  the  classification  of  an  objective  for  which  insuf¬ 
ficient  items  exist  for  a  mastery/nonmastery  decision  is  "indeterminate."  This 
decision  occurred  whenever  the  pool  of  available  items  was  exhausted  before  a 
mastery/nonmastery  decision  could  be  made.  Such  an  objective  is  presently 


-  107 


treated  as  "unmastered,"  although  this  could  be  altered  without  affecting  other 
components  of  the  model.  Rather  than  assuming  the  objective  to  be  unmastered, 
the  process  could  ascertain  which  classification  zone  was  approached  by  the  ex¬ 
aminee's  proportion  of  items  answered  correctly.  Ferguson  (1969)  used  this  pro¬ 
cedure,  but  only  after  asking  for  30  item  responses  for  the  objective.  It  ap¬ 
pears  that  if  an  examinee  cannot  demonstrate  mastery  performance  within  a  real¬ 
istically  expected  number  of  items,  immediately  prescribing  remedial  instruction 
would  be  more  efficient  than  giving  a  lengthy  test  to  make  a  decision.  An  ob¬ 
jective  for  which  an  undesirably  high  proportion  of  "indeterminate"  classifica¬ 
tions  has  been  made  indicates  an  insufficient  number  of  items,  insufficient  item 
discriminations,  or  unrealistically  high  specifications  for  acceptable  misclas- 
sification  errors. 

The  adaptive  testing  procedure  terminated  when  either  of  the  following  con¬ 
ditions  occurred:  (1)  all  objectives  were  classified  as  mastered  or  unmastered; 
or  (2)  the  number  of  prior  examinee  observations  in  the  data  base  upon  which 
predictions  are  based  was  less  than  two.  For  the  first  condition,  the  test  was 
cerminated.  For  the  second  condition,  unpresented  and  unpredicted  items  corre¬ 
sponding  to  objectives  of  concern  were  randomly  presented  to  the  examinee.  Ter¬ 
mination  of  the  test  occurred  when  each  objective  was  classified. 

Eight  Versions  of  the  Adaptive  Testing  Model 

The  adaptive  testing  model  formulated  for  this  study  was  applied  in  a  2  x  2 
x  2  configuration  of  options.  These  derive  from  three  options,  each  with  two 
conditions:  (1)  two  methods  of  item  selection  based  upon  item-objective  agree¬ 

ment  and  inter-item  agreement;  (2)  two  response  matching  procedures  based  upon 
only  item  response  patterns  (only  _£)  and  upon  both  item  response  and  objective 
classification  patterns  (both  r  and  s);  and  (3)  a  dichotomous  option  regarding 
examinee  response  inconsistency.  Table  1  provides  a  delineation  of  the  options 
used  for  each  version;  the  numbers  used  in  the  remainder  of  the  report  refer  to 
combinations  of  options  employed. 


Phase  I:  Monte  Carlo  Simulations 


The  purpose  of  this  phase  of  the  study  was  twofold:  (1)  to  test  for  the 
relative  accuracy  and  efficiency  of  the  eight  versions  of  the  adaptive  testing 
model  and  a  control  version  and  (2)  to  study  the  relation  of  loss  to  individu¬ 
als'  achievement  levels  for  the  adaptive  testing  versions.  Accuracy  was  exam¬ 
ined  in  terms  of  correct  mastery/nonraastery  classifications.  Efficiency  was 
investigated  in  terms  of  the  number  of  items  presented  to  examinees. 

The  control  version  to  which  the  adaptive  testing  versions  were  compared 
involved  the  testing  of  every  objective.  For  each  objective  a  prespecified  num¬ 
ber  of  items  was  randomly  selected  for  each  examinee.  Under  the  control  treat¬ 
ment,  examinees  generally  received  different  items  for  an  objective,  but  each 
received  the  same  number  of  items.  For  each  objective  a  randomly  selected  inte¬ 
ger  between  3  and  6,  inclusive,  was  chosen  for  the  number  of  items  to  be  presen¬ 
ted.  Mastery  of  an  objective  was  obtained  if  an  examinee  obtained  a  score  of 
N-l  or  higher,  where  N  equals  the  number  of  items  presented.  A  score  of  less 
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Table  1 

Options  Employed  in  the  Eight  Versions 
of  the  Adaptive  Testing  Model 


Testing 

Version 

Item  Selection 
Method 

Response  Matching1 
Procedure 

Inconsistency 

Check 

1 

Item -objective 

Only  s 

No 

2 

Inter-item 

Only  ^ 

No 

3 

Item-objective 

Both  r 

and  ? 

No 

4 

Inter-item 

Both  r 

and  ^ 

No 

5 

Item -objective 

Only  s 

Yes 

6 

Inter-item 

Only  "s 

Yes 

7 

Item-objective 

Both  r 

and  ^ 

Yes 

8 

Inter-item 

Both  "r 

and  ^ 

Yes 

t  is  the  item  response  vector  and  ?  is  the  objective 
mastery/nonraastery  classification  vector. 


than  N-l  resulted  in  a  nonmastery  classification.  The  resulting  lengths  of  the 
tests  and  the  mastery  criteria  reflected  the  parameters  used  in  the  Air  Force 
Weapons  Mechanics  training  program  at  Lowry  Air  Force  Base,  Denver,  Colorado. 

Item  response  generation.  Item  response  data  were  generated  for  hypotheti 
cal  examinees  who  were  to  demonstrate  some  consistency  in  performance  across 
examinations.  This  assumes  that  individuals  in  instructional  programs  demon¬ 
strate  a  certain  consistent  performance  in  mastering  or  not  mastering  objec- 
t ives . 

For  each  examination  by  adaptive  test  version,  two  sets  of  examinee  data 
were  generated — one  representing  past  examinees'  responses  and  the  other  includ 
ing  responses  that  would  be  obtained  from  present  examinees.  For  the  control 
version,  only  one  set  of  data  was  generated  for  each  examination.  A  set  of  ex¬ 
aminee  responses  was  generated  in  two  steps  using  two  computer  programs,  GENTAB 
and  GENRESP.  For  each  examinee  GENTAB  produced  values  for  elements  of  consis¬ 
tency  to  be  demonstrated  across  testings.  These  elements  were  the  examinee's 
achievement  level  and  risk  of  guessing.  The  values  from  GENTAB  and  additional 
parameters  were  used  to  produce  item  responses  through  program  GENRESP.  Parame 
ters  specified  for  GENRESP  included  the  following:  (1)  hierarchical  configura¬ 
tion  of  the  objectives;  (2)  objective  parameters,  such  as  difficulty;  (3)  dis¬ 
crimination,  and  passing  criteria;  (4)  proportion  and  type  of  hierarchical  er¬ 
rors;  and  (5)  guessing  factor  for  answering  items  correctly. 

Generation  of  examinees'  true  item  responses.  For  each  objective,  each 
item  response  for  an  examinee  was  based  on  a  probability  of  answering  the  item 
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correctly.  The  algorithm  used  was 

6  -  ? 


P  (K  -  1)  » 


d  + 


(1  -  d)  if  e  >  9 


1-0 


d  +  — - —  d  if  0  <  0  , 


where 


[6] 


P(u  =  1) 

d 

0 

e 


the  probability  of  answering  the  item 
correctly; 

difficulty  of  the  item; 
examinee's  objective  score;  and 
mean  objective  score  of  the  corresponding 
mastery/nonmastery  group. 


A  random  number  r  in  the  closed  interval  0  to  1  was  selected.  If  r  <  P(u  =  1), 
the  examinee  was  assigned  a  correct  item  response;  otherwise,  an  incorrect  item 
response  was  assigned. 


Inclusion  of  examinee  error.  The  factor  of  successful  guessing  was  includ¬ 
ed  in  GENRESP.  The  probability  that  an  examinee  would  attempt  to  guess  the  cor¬ 
rect  answer,  given  that  his/her  "true"  response  would  be  incorrect,  was  derived 
by  the  formula 

Pj  =  <?,(1  -  0rf) 

where 


gj  is  the  risk  factor  for  the  examinee  (from  GENTAB) ; 

0  is  the  examinee's  objective  score;  and 
d  is  the  item  difficulty  for  the  examinee's  mastery  or 
nonmastery  group. 

A  random  number  _r  in  the  interval  0  to  1  was  selected.  If  jj  <_  Pj ,  the 
examinee  would  attempt  to  guess  the  correct  answer.  The  probability  of  guessing 
correctly  was  obtained  from  the  formula 

p2  =  g2  +  g 2  ed  [8] 

where  g2  is  the  guessing  factor  for  the  item  (the  probability  of  randomly  se¬ 
lecting  the  correct  answer),  and  0  and  d  are  the  same  as  defined  previously. 

For  all  items,  ^  was  set  equal  to  .2,  assuming  five  alternatives  to  each  item. 

A  random  number  jj>  in  the  interval  0  to  1  was  selected.  If  r_2  <_  P2 ,  the  examin¬ 
ee  was  credited  with  answering  the  item  correctly. 

Experimental  Design 

The  design  employed  90  cells  comprised  of  an  element  from  each  of  the  fol- 
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lowing  two  dimensions  (independent  variables):  (1)  Testing  Version  (8  adaptive 
test  versions  and  1  control  test  version)  and  (2)  Examination  (10  examinations). 
For  each  testing  version,  data  were  simulated  for  50  hypothetical  examinees, 
each  of  whom  was  to  take  10  examinations  using  only  1  testing  version  across  the 
10  examinations.  Hence,  there  were  450  hypothetical  examinees,  each  taking  10 
examinations. 

Separate  split-plot  factorial  analyses  of  variance  were  conducted  for  each 
of  two  dependent  variables.  The  dependent  variables  were  (1)  total  loss  associ¬ 
ated  with  errors  in  mastery/nonmastery  classifications  and  (2)  total  number  of 
items  presented. 

Total  loss.  A  loss  value  is  a  positive  or  zero  number  assigned  to  an  ac¬ 
tion-outcome  combination  (Hays  &  Winkler,  1970).  A  zero  loss  value  is  assigned 
to  any  combination  that  reflects  the  best  actions  under  the  true  circumstances. 
If  an  action  is  less  desirable  than  the  best  actions,  an  error  is  associated 
with  the  action  and  is  assigned  a  positive  value  reflecting  the  level  of  error 
involved. 

The  loss  values  appearing  in  Table  2  represent  the  relative  amounts  of  loss 
attributed  to  each  mastery/nonmastery/indeterminate  decision  made,  given  the 
"true"  mastery/nonmastery  status.1  It  can  be  seen  in  Table  2  that  under  the 
known  true  situation  of  mastery,  the  best  decision  was  to  classify  performance 
on  an  objective  as  "mastery."  The  positive  numbers  for  decisions  of  "nonmastery" 
and  "indeterminable"  indicate  there  were  errors  involved  with  these  decisions — 
the  greater  error  being  associated  with  the  latter.  Total  loss  equals  the  sum 
of  the  separate  losses  incurred  for  each  objective  decision  for  an  examinee. 

Table  2 

Matrix  of  Loss  Values  Provided  for 
Objectives  of  Primary  and  Secondary  Concern 


True  Classification 
Classification  Decision  Mastery  Nonmastery 


Objectives  of  Primary  Concern 

Mastery 

0 

10 

Nonmastery 

5 

0 

Indeterminable 

7 

3 

Objectives  of  Secondary  Concern 

Mastery 

0 

6 

Nonmastery 

4 

0 

Indeterminable 

5 

2 

Total  number  of  items  presented.  Items  for  the  adaptive  tests 
ted  to  provide  information  for  predicting  correctness/incorrectness 


were  presen- 
of  other 


1  Roger  Pennel  l 
Base  provided 
knowledgeable 


of  the  Air  Force  Human  Resources  Laboratory  at  Lowry 
losses  based  upon  values  independently  obtained  from 
of  the  Air  Force  Weapons  Mechanics  training  program. 


Air  Force 
individuals 


1 


.  s 


-  Ill  - 


items.  The  total  number  of  items  presented  refers  to  the  number  of  items  an¬ 
swered  by  an  examinee  in  order  to  make  mastery/nonmastery  decisions  on  objec¬ 
tives. 


Experimental  model.  The  split-plot  factorial  model  used  was 

X  .  . .  =  y  +  /4.  +  B  .  +  it  ,  ,  . .  +  AB  .  .  +  Bit.,...  +  e  .....  [9] 

ijkm  i  J  k(t)  to  ok(t)  m(tjk) 


where 


Xijkm  is  dependent  variable; 

is  the  testing  version; 

Bj  is  the  examination;  and 
ffk(i)  *s  subject  effect. 

A  posteriori  tests.  With  regard  to  the  testing  version  effect,  the  Dun- 
nett's  t^  statistic  was  computed  for  each  adaptive  testing  version  with  the  con¬ 
trol  treatment.  This  a  posteriori  test  was  used  for  each  dependent  variable, 
regardless  of  the  F  value  obtained  using  the  analysis  of  variance  (Winer,  1971, 
p.  201).  Therefore,  each  version  was  compared  with  the  control  treatment.  For 
other  effects,  Newman— Keuls  tests  were  performed  only  when  significant  F  values 
(a  “  .05)  were  obtained  from  the  analyses  of  variance. 

Sample  size.  Each  data  base  from  which  predictions  were  made  was  composed 
of  300  sets  of  responses.  For  each  of  the  90  testing  versions  by  examination 
cells,  50  hypothetical  examinees  were  used. 

a  and  3  levels.  In  this  phase  of  the  study,  the  values  of  a  and  B  relative 
to  the  Wald  procedure  were  set  at  .2  and  .1,  respectively. 

Results 


All  of  the  adaptive  testing  versions  were  significantly  more  efficient  than 
the  control  version.  Only  one  adaptive  testing  version  demonstrated  signifi¬ 
cantly  smaller  losses  than  the  control  version.  An  analysis  of  variance  indi¬ 
cated  significant  Examination  and  Testing  Version  x  Examination  effects  (a  « 
.05).  A  quasi-F  statistic  was  computed  for  the  testing  version,  since  the 
mixed-effects  model  did  not  directly  provide  a  mean  sums-of-squares  estimate  for 
the  required  denominator  (Winer,  1971,  pp.  375-378).  Table  3  shows  the  results 
of  the  analysis  of  variance,  and  Table  4  provides  the  descriptive  statistics  for 
each  testing  version. 


The  use  of  Hartley's  test  for  homogeneity  of  variance  (Winer,  1971, 
pp.  207-208)  resulted  in  a  rejection  of  the  equal  variance  assumption.  Hence,  a 
more  conservative  test  proposed  by  Box  (Winer,  1971,  p.  206)  was  used.  The  de¬ 
grees  of  freedom  corresponding  to  each  numerator  were  reduced  to  one.  The  test 
effect  remained  significant  at  the  .05  level,  but  the  Treatment  x  Test  interac¬ 
tion  did  not. 

Dunnett's  test  inuicated  that  the  only  adaptive  testing  version  signifi- 
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Table  3 

Analysis  of  Variance  For  Total  Loss 


Source 

df 

Mean 

Square 

F 

Between  Subjects 

Testing  version 

449 

8 

623.32 

1.14 

Sub jects-within-groups 

441 

525.15 

Estimates  for  quasi-F 
calculations 

457 

544.624 

Within  Subjects 

Examination 

4050 

9 

754.74 

36.61* 

Testing  version  x 
examination 

72 

40.09 

1.94** 

Examination  x  subjects- 
within-groups 

3969 

20.62 

*£  <.01. 

**p  <.01  for  df(72,3969);  £  <.25  for  df(l,3969). 


cantly  different  (a  =  .05)  from  the  control  test  was  the  sixth  version — Adaptive 
Testing  Version  6,  using  the  inter-item  agreement,  based  only  on  the  item  re¬ 
sponse  vector,  and  employing  the  inconsistency  check.  Although  the  obtained  £ 
value  of  Adaptive  Testing  Version  7  did  not  exceed  the  critical  value,  the  dif¬ 
ference  in  the  two  was  extremely  small.  The  losses  obtained  for  both  versions 
were  extremely  close.  Adaptive  Testing  Version  7  used  item-objective  agreement, 
based  on  both  item  response  and  objective  classification  vectors,  and  employed 
the  inconsistency  check. 


Table  4 

Descriptive  Statistics  of  Total  Loss  for 
Each  Examinee  per  Testing  Version 


Testing 

Version 

Mean 

SD 

Rang 

Min 

e 

Max 

1 

5.84 

10.25 

0 

52 

2 

5.52 

9.56 

0 

48 

3 

5.88 

9.79 

0 

52 

4 

5.39 

9.25 

0 

60 

5 

5.03 

8.46 

0 

46 

6 

4.73 

8.63 

0 

49 

7 

4.84 

8.55 

0 

50 

8 

5.05 

8.40 

0 

44 

Control 

8.40 

8.45 

0 

60 

The  Newman-Keuls  test  indicated  no  pattern  of  significantly  different  loss¬ 
es  among  the  examinations.  Although  significant  differences  did  occur  between 
some  pairs  of  examinations,  no  trend  was  indicated.  The  Testing  Version  x  Exam- 
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ination  interaction  was  not  significant  using  the  conservative  F  test.  There 
was  a  tendency  for  all  versions  of  the  model  to  obtain  approximately  the  same 
losses  for  each  examination  and  to  have  losses  less  than  the  conventional  test, 
except  for  the  third  examination. 

For  the  number  of  items  presented,  an  analysis  of  variance  indicated  sig¬ 
nificant  Testing  Version,  Examination,  and  Testing  Version  x  Examination  effects 
(a  =  .05).  As  with  the  other  dependent  variables,  a  quasi-F  statistic  was  cal¬ 
culated  for  the  testing  version  effect.  All  the  effects  were  also  significant 
(a  =  .05)  for  the  more  conservative  F  test,  used  because  of  the  heterogeneous 
variances.  Table  5  shows  the  results  of  the  analysis,  and  Table  6  provides  the 
descriptive  statistics  for  the  number  of  items  presented. 

Table  5 


Analysis  of  Variance  For  Number  of  Items  Presented 


Source 

df 

Mean 

Square 

F 

Between  Subjects 

449 

Testing  version 

8 

31285.58 

256.06* 

Sub ject  s-with in-groups 

441 

2.56 

Estimates  for  quasi-F 

72 

122.18 

calculations 

Within  Subjects 

4050 

Examinat ion 

9 

276.51 

142.53* 

Testing  version  x 

examinat ion 

72 

121.56 

62.66* 

Examination  x  subjects- 

wi thin-groups 

3969 

1.94 

*£  <.01. 


The  results  of  the  Newman-Keuls  tests  for  the  testing  version  effect  showed 
that  each  adaptive  test  required  significantly  fewer  (a  =  .05)  items  than  the 
control  test.  There  were  no  significant  differences  among  the  adaptive  ver¬ 
sions. 

Although  significant  differences  existed  in  numbers  of  items  presented  for 
the  10  examinations,  the  adaptive  testing  versions  varied  only  slightly  in  their 
relative  efficiency.  A  version  that  appeared  to  require  the  fewest  items  on  one 
examination  may  have  required  the  most  on  another  examination.  The  differences 
in  the  number  of  items  required  by  the  adaptive  versions  for  any  one  test  were 
not  substantially  different. 

Loss  as  a  Function  of  Achievement  Levels 

Although  Adaptive  Testing  Version  6  demonstrated  overall  superior  accuracy, 
the  losses  incurred  for  all  examinees  were  not  the  same.  More  importantly,  the 
losses  relative  to  examinees'  general  achievement  may  be  small  for  some  levels 
but  high  for  others.  The  mean  losses  as  a  function  of  examinees'  achievement 
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The  adaptive  testing  versions  had  smaller  losses  for  the  middle  and  upper 
achievement  levels,  but  this  was  reversed  for  the  lower  levels.  This  difference 
could  be  eliminated  by  reducing  the  a  level.  It  may  be  recalled  that  3  was  set 
to  .1,  whereas  a  was  set  at  .2.  Since  the  false  nonmastery  error  would  be  larg¬ 
er  than  the  false  mastery  error,  a  higher  proportion  of  false  classifications 
would  be  expected  for  those  at  the  lower  achievement  levels. 

The  adaptive  testing  versions  may  have  produced  more  inaccurate  classifica¬ 
tions  due  to  the  paucity  of  data  representative  of  poorer-achieving  students. 
Since  only  a  small  proportion  of  examinees  in  the  data  base  did  not  master  the 
objectives,  the  predictions  made  for  the  poorer-achieving  students  were  often 
based  on  relatively  few  data  cases.  Such  was  not  the  case  for  those  with  higher 
achievement  levels. 

Selection  of  Adaptive  Testing  Versions  for  the  Next  Phase 


Hie  intention  of  the  next  phase  of  the  study  was  to  compare  the  results  of 
some  of  the  adaptive  testing  versions  with  those  obtained  in  the  present  testing 
system  used  in  the  Air  Force  Weapons  Mechanics  training  program  at  Lowry  Air 
Force  Base.  Adaptive  Testing  Versions  4  and  6  were  selected.  No  version  was 
significantly  superior  in  numbers  of  items  presented.  Adaptive  Testing  Version 
6  was  selected  because  of  its  superior  accuracy.  Adaptive  Testing  Version  4  was 
selected,  however,  solely  on  the  basis  of  the  mean  number  of  items  presented  for 
item  prediction. 


Purpose 


Phase  II:  Real  Data  Simulations 


The  purpose  of  this  phase  of  the  study  was  to  compare  (1)  the  relative  ef¬ 
ficiency  of  Adaptive  Testing  Versions  4  and  6  with  each  other  and  with  the  pre¬ 
sent  testing  method  used  in  the  Weapons  Mechanics  training  program  and  (2)  the 
classification  decisions  made  from  the  adaptive  testing  versions  with  those  made 
by  the  pesent  method  used  in  the  Weapon  Mechanics  training  program. 

Design 


The  control  testing  version  for  this  phase  was  a  testing  procedure  consist¬ 
ing  of  a  fixed  set  of  items  for  each  objective.  Hence,  all  examinees  answered 
the  same  set  of  items  under  the  control  treatment. 

Classification  decisions  made  by  the  adaptive  testing  and  control  testing 
versions  were  compared  using  an  index  defined  as  the  number  of  agreements  minus 
the  number  of  disagreements.  An  agreement  in  classifying  an  examinee's  perfor¬ 
mance  on  an  objective  was  obtained  when  both  indicate  "nonmastery."  Since  for 
the  adaptive  tests,  performance  classified  as  "indeterminate"  dictated  proce¬ 
dures  identical  to  those  classified  as  "nonmastery,"  this  condition  was  also 
considered  an  agreement.  The  a  and  8  values  selected  were  the  same  as  in  the 
previous  phase — .2  and  .1,  respectively. 

Data  that  were  actually  collected  on  four  examinations  in  the  Weapon  Me¬ 
chanics  training  program  were  used  in  the  computer  simulations  for  this  phase. 
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For  each  examination,  from  250  to  290  response  sets  were  available.  It  was  not 
feasible  to  match  student  identification  codes  across  the  examinations,  since 
there  was  no  control  over  the  forms  of  the  tests  taken  by  the  examinees.  For 
each  examination,  the  first  150  response  sets,  sorted  in  ascending  chronological 
order,  were  used  to  form  the  data  base.  Of  the  remaining  subjects,  50  were  ran¬ 
domly  selected  as  the  examinees  who  were  to  take  the  simulated  adaptive  tests. 
Hence,  within  each  examination  the  same  50  trainees  were  used  as  examinees,  re¬ 
gardless  of  the  testing  version;  but  the  same  50  trainees  were  not  used  across 
examinat ions . 

The  assumed  hierarchical  configurations  for  the  objectives  for  each  exami¬ 
nation  were  provided  by  Roger  Pennell  of  the  Air  Force  Human  Resources  Labora¬ 
tory,  Lowry  Air  Force  Base,  Denver,  Colorado.  The  mastery  score  for  an  objec¬ 
tive  with  N(>  2)  items  was  set  to  N  -  1,  as  is  presentlv  done  with  conventional 
testing  procedure,  which  is  referred  to  here  as  the  control  testing  version.  If 
N  equaled  1,  the  cutting  score  was  set  to  1. 

Correlated  t  tests  were  used  to  compare  adaptive  testing  versions.  A  t 
test  for  a  mean  equal  to  a  constant  was  employed  for  each  comparison  of  each 
adaptive  testing  version  to  the  control  testing  version. 

Results 


Both  adaptive  testing  versions  used  in  this  phase  of  the  study  demonstrated 
that  each  required  significantly  fewer  items  than  the  control  testing  version. 
Version  4  of  the  model  required  the  presentation  of  fewer  items  than  Version  6. 

Efficiency.  Adaptive  Testing  Version  4  required  statistically  significant¬ 
ly  (t  =  8.30,  d_f  =  199,  _p  <  ,001)  fewer  items  than  Version  6.  The  descriptive 
statistics  for  these  versions  are  shown  in  Table  7.  Although  there  was  a  sta¬ 
tistical  difference,  the  superior  efficiency  of  Version  4  amounted  to  less  than 
one  item  per  examinee  per  examination. 

Table  7 

Descriptive  Statistics  for  Adaptive 
Testing  Versions  4  and  6 


Adapt ive 
Testing  Version 

Variable  and  Statistic  4 5~ 


Number  of  Items  Presented 


Mean 

3.02 

3.92 

SD 

1.19 

1.42 

Index  of  Agreement 

Mean 

6.15 

5.54 

SD 

3.39 

3.27 

Mastery/nonmastery  decisions.  Adaptive  Testing  Version  4  had  a  statisti¬ 
cally  significantly  (t  =  5.58,  df  =  199,  p  <  .001)  higher  agreement  in  mas- 
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tery/nonmastery  classifications  than  Version  6.  The  descriptive  statistics  for 
these  versions  are  also  shown  in  Table  7. 

The  average  number  of  objectives  per  examination  was  7.25.  Hence,  the 
range  of  the  index  could  be  from  -7.25  to  7.25.  A  complete  agreement  in  deci¬ 
sions  would  result  in  an  index  value  of  7.25;  a  complete  disagreement  would  re¬ 
sult  in  a  value  of  -7.25.  In  terms  of  percent  of  agreements  in  decisions,  Ver¬ 
sions  4  and  b  had  92%  and  88%  agreement  with  the  control  testing  version,  re¬ 
spectively. 

Separate  t  tests  were  performed  on  the  number  of  items  presented  for  each 
of  the  adaptive  testing  versions  compared  to  the  number  required  by  that  of  the 
control  version.  The  mean  number  of  items  presented  under  the  control  testing 
version  across  the  four  tests  was  15.25.  The  number  of  items  required  by  the 
adaptive  testing  version  are  presented  in  Table  8.  The  visual  comparison  of  the 
tabled  values  reveals  such  large  differences  that  no  statistical  test  was  neces¬ 
sary. 


Since  the  four  examinations  differed  in  hierarchical  configurations,  number 
of  objectives,  and  number  of  available  items,  Table  8  presents  the  percent  of 
reduction  in  test  items  required  by  the  adaptive  testing  versions  in  relation  to 
the  control  testing  version  for  each  examination.  The  table  also  shows  the  per¬ 
cent  of  agreements  in  mastery/nonmastery  decisions  between  each  adaptive  testing 
version  and  the  control  testing  version. 

Table  8 


Comparison  of  Results  of  Adaptive  Testing  Versions  4  and  6 
to  Control  Testing  Version  for  Each  Examination 


Adapt ive 
Testing 
Version  and 
Examinat ion 

Number  of  Items 
Presented 

Adaptive 
Control  Testing 
Version  Version 

Percent 
of  Item 
Reduction 

Number 

of 

Objectives 

Percent  of 
Mastery 
and  Non- 
Mastery 
Agreements 

Version  4 

1 

20 

4.3 

79 

14 

91 

2 

12 

2.6 

78 

4 

98 

3 

14 

2.5 

82 

6 

86 

4 

15 

3.2 

79 

5 

99 

Version  6 

1 

20 

5.1 

75 

14 

87 

2 

12 

4.2 

65 

4 

92 

3 

14 

2.4 

83 

6 

84 

4 

15 

4.0 

73 

5 

93 

The  results  show  that  both  Adaptive  Testing  Versions  4  and  6  made  most  of 
the  same  mastery/nonmastery  decisions  as  were  presently  being  made  by  the  Air 
Force  in  its  Weapons  Mechanics  program;  but  the  adaptive  testing  versions  make 
the  decisions  with  approximately  75%  fewer  items  than  the  conventional,  or  con¬ 
trol,  version. 
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Discussion  and  Conclusions 

Both  simulation  phases  of  the  study  have  shown  that  the  adaptive  testing 
versions  could  make  mastery/nonmastery  decisions  much  more  efficiently  than 
testing  on  each  objective  with  a  constant  number  of  items  for  each  objective 
presented. 

The  real-data  simulation  showed  that  the  mastery/nonmastery  agreement  be¬ 
tween  the  control  testing  version  and  the  adaptive  testing  versions  was  higher 
for  Adaptive  Testing  Version  4.  This  does  not  mean  that  Version  4  is  more  accu¬ 
rate  than  Version  6.  On  the  contrary,  in  the  first  simulation  it  was  demon¬ 
strated  that  Adaptive  Testing  Version  b  was  the  only  adaptive  procedure  that  had 
significantly  smaller  loss  than  the  control  version.  In  essence,  Adaptive  Test¬ 
ing  Version  4  and  the  control  version  in  the  second  simulation  phase  would  be 
expected  to  be  equally  as  inaccurate  in  mastery /nonmastery  decisions.  Adaptive 
Testing  Version  6  would  be  expected  to  be  more  accurate  than  the  control  version 
and,  hence,  would  have  fewer  agreements  with  the  control  version  than  would  Ver¬ 
sion  4. 

Although  in  both  phases  statistically  significant  differences  were  found 
among  the  adaptive  testing  versions,  the  assignment  of  different  values  to  the 
version's  parameters  might  equalize  all  results.  All  the  adaptive  testing  ver¬ 
sions  were  used  with  the  same  values  specified  for  the  model’s  parameters.  For 
example,  for  all  versions,  a  and  8  were  set  at  .2  and  .1,  respectively.  The 
versions  may  be  differentially  sensitive  to  the  parameters.  Hence,  two  versions 
may  be  expected  to  perform  exactly  the  same,  but  only  by  specifying  different 
values  for  the  same  parameters. 

For  both  simulation  phases  of  the  study  the  number  of  sets  of  responses 
needed  in  the  data  bases  were  unknown.  For  the  second  simulation  phase  it  was 
estimated  that  150  sets  would  be  sufficient.  The  results  indicate  that  an  aver¬ 
age  of  29  sets  matched  each  examinee's  set  on  each  test.  The  average  of  29  sets 
per  examinee  did  not  give  sufficient  information  as  to  whether  the  data  base  was 
of  sufficient  size.  The  ranges  in  number  of  sets  indicated  that  for  every  test 
and  for  every  adaptive  testing  procedure  the  data  base  was  completely  depleted 
for  some  examinees.  As  in  the  first  simulation,  it  may  not  be  that  the  data 
base  contained  insufficient  numbers  of  response  patterns  but  that  there  was  an 
insufficient  number  of  patterns  for  poorer  performing  individuals.  In  both 
phases  the  data  bases  were  composed  of  response  patterns  representative  in  type 
and  proportion  to  those  patterns  expected  in  the  population  of  examinees.  It 
appears  that  when  a  high  proportion  of  examinees  mastered  the  objectives,  as  in 
the  Weapons  Mechanics  program,  such  a  data  base  is  insufficient  for  predictions 
of  performance  by  nonraaster ing  examinees.  Hence,  in  such  a  situation,  oversam¬ 
pling  of  nonmastering  examinees  may  be  required  in  order  to  provide  adequate 
data  for  all  levels  of  performance. 

Because  of  the  similarity  of  the  results  for  all  the  adaptive  testing  ver¬ 
sions  in  the  monte  carlo  simulations  and  the  superior  efficiency  demonstrated  by 
Adaptive  Testing  Versions  4  and  6  procedures  in  the  real-data  simulations,  it 
appears  that  any  of  the  adaptive  testing  variations  used  in  this  study  would  be 
much  more  efficient  than  the  conventional  testing  procedure  used  by  the  Air 
Force . 
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A  Comparison  of  ICC-Based  Adaptive  Mastery  Testing  and  the 
Wald i an  Probability  Ratio  Method 


G.  Gage  Kingsbury  and  David  J.  Weiss 
University  of  Minnesota 


The  use  of  criterion-referenced  achievement  test  interpretation  has  gained 
great  support  within  the  educational  measurement  community  since  its  introduc¬ 
tion  less  than  two  decades  ago  (Glaser  &  Klaus,  1962).  It  is  intuitively  ap¬ 
pealing  to  educators  to  be  able  to  measure  students'  performances  against  an 
absolute  standard  of  behavior  on  prespecified  learning  objectives,  and  the  use 
of  criterion-referenced  test  interpretation  gives  educators  this  capability. 

One  of  the  most  basic  forms  of  criterion-referenced  test  interpretation  involves 
classifying  students  into  two  categories — one  containing  students  who  have 
achieved  a  sufficient  command  of  the  subject  matter  (mastery)  and  the  other  con¬ 
taining  students  who  have  not  achieved  a  sufficient  command  of  the  subject  mat¬ 
ter  (nonmastery).  Traditionally,  a  student  is  declared  a  master  if  his/her 
score  on  a  conventional  classroom  achievement  test  is  as  high  or  higher  than  a 
prespecified  cutoff  point  or  is  declared  a  nonmaster  if  his/her  score  on  the 
test  is  lower  than  the  cutoff  point.  This  form  of  classroom  testing  has  been 
called  mastery  testing  and  can  be  useful  (1)  in  determining  the  degree  of  stu¬ 
dent  proficiency  within  a  classroom  and  (2)  as  a  diagnostic  tool  to  identify 
individuals  who  need  further  training  in  specific  instructional  areas  (Nitko  & 
Hsu,  1974). 

As  traditional  mastery  testing  has  been  developing  its  own  technology, 
adaptive  testing  technology  has  also  developed  to  allow  educators  to  make  maxi¬ 
mum  use  of  classroom  testing  time  while  reducing  the  amount  of  time  spent  on 
testing  to  a  minimum.  The  use  of  adaptive  testing  techniques  has  recently  been 
shown  to  be  effective  in  reducing  test  length  while  obtaining  high-fidelity 
achievement  level  estimates  in  several  instructional  settings  (e.g.,  Bejar, 
Weiss,  &  Gialluca,  1977;  Brown  &  Weiss,  1977). 

Mastery  and  adaptive  testing  technologies  have  each  shown  their  usefulness 
in  the  academic  setting  for  different,  but  compatible,  reasons.  It  is  therefore 
not  surprising  that  a  fusion  of  the  two  techniques  should  occur  in  order  to  al¬ 
low  mastery  testing  to  be  accomplished  in  the  shortest  possible  class  time  while 
maintaining  the  accurate  decisions  necessary  for  correct  diagnoses  of  student 
instructional  problems. 

Approaches  to  Adaptive  Mastery  Testing 


Two  attempts  that  have  been  made  to  combine  mastery  and  adaptive  testing 
technologies  have  been  Ferguson's  (1969,  1970)  application  of  Wald's  Sequential 
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Probability  Ratio  Test  (SPRT)  to  mastery  testing  and  Kingsbury  and  Weiss's 
(1979a)  formulation  of  an  item  characteristic  curve  (ICC)  approach  to  adaptive 
mastery  testing  (AMT).  Both  of  these  testing  procedures  attempt  to  accomplish 
two  common  ends.  First,  the  procedures  seek  to  shorten  the  length  of  the  test. 
Second,  the  procedures  use  statistical  techniques  designed  to  hold  the  number  of 
misclassif ications  (i.e.,  individuals  for  whom  the  wrong  decision  is  made)  to 
some  acceptable  minimum.  The  methods  by  which  these  two  procedures  attempt  to 
accomplish  these  ends  are  quite  different. 

The  very  fact  that  two  procedures  exist  that  attempt  to  accomplish  the  same 
basic  ends  through  different  techniques  renders  a  comparison  of  the  two  methods 
desirable.  The  prime  objective  of  this  paper,  then,  was  a  comparison  of  the 
efficiency  with  which  these  two  procedures  for  mastery  testing  achieved  their 
goals  of  reducing  test  length  while  obtaining  a  high  percentage  of  correct  deci¬ 
sions.  The  first  level  of  comparison  presented  here  is  a  descriptive  comparison 
based  on  the  theories  underlying  each  of  the  procedures.  This  is  followed  by  an 
empirical  comparison  of  the  two  testing  procedures  within  the  context  of  a  monte 
carlo  simulation  of  test  responses  designed  to  fit  a  number  of  theoretical  con¬ 
tingencies. 

Wald's  SPRT  Applied  to  Mastery  Testing 

The  SPRT  procedure.  Wald's  (1947)  SPRT  was  originally  designed  as  a  quali¬ 
ty  control  test  for  use  in  a  manufacturing  setting.  It  was  designed  to  deter¬ 
mine  whether  a  large  consignment  of  products  (e.g.,  light  bulbs)  contained  a 
small  enough  proportion  of  defective  bulbs  to  pass  some  prespecified  quality 
criterion  while  only  testing  a  small  sample  of  the  light  bulbs  in  the  consign¬ 
ment.  Wald's  solution  to  this  problem  was  to  draw  light  bulbs  sequentially  from 
the  consignment,  to  test  the  light  bulb  drawn  at  each  stage,  and  to  determine  at 
each  stage  the  relative  probabilities  of  the  following  two  hypotheses: 


V  p  =  P0 

[1] 

Hl:  p  =  -D1 

f2] 

the  proportion  of  defective  elements  (light  bulbs)  in  the  population 
(cons ignment ) ; 

the  proportion  of  defective  elements  in  the  population  below  which  it 
is  always  desired  to  accept  the  quality  of  the  population;  and 
the  proportion  of  defective  elements  in  the  population  above  which  it 
is  always  desired  to  reject  the  quality  of  the  population. 

Since  each  stage  of  the  sampling  procedure  may  be  viewed  as  a  Bernoulli 
trial  (given  that  each  element  is  sampled  at  random  without  replacement  from  the 
population  of  equivalent  elements  and  assigned  either  nondefective  or  defective 
status),  the  probability  of  observing  a  certain  number  of  defective  elements  in 
a  sample  of  a  certain  size,  given  that  either  Hq  or  is  true,  may  be  described 

with  the  binomial  probability  function.  Consequently,  the  probability  of  ob¬ 
serving  W  defective  elements  in  a  sample  of  m  elements  (Wm),  under  Hq:  j>  =  jjq 
is 


where 

P  = 
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>0  =  P(m  (1  -  P0/m 

m 


[3] 


Under  H^:  £  =  p^,  the  probability  becomes 


=  p. 


m 


(1  _  Pi)Wm 


[4] 


The  ratio  of  these  two  probabilities  yields  an  index  of  the  relative 
strengths  of  the  two  hypotheses  such  that  at  each  stage  in  the  sampling  proce¬ 
dure  the  quality  of  the  consignment  may  be  either  rejected  or  accepted,  or  sam¬ 
pling  of  elements  may  be  continued.  The  stringency  of  the  test  is  based  (1)  on 
the  proportion  (a)  of  errors  willing  to  be  tolerated  in  rejecting  the  quality  of 
the  consignments  that  actually  do  have  the  quality  desired  and  (2)  on  the  pro¬ 
portion  (0)  of  errors  willing  to  be  tolerated  in  accepting  the  quality  of  con¬ 
signments  that  do  not  actually  have  the  minimum  acceptable  quality. 

In  its  final  log  form  the  test  used  by  the  SPRT  at  each  stage  of  sampling 

specifices  that  if 
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sampling  continues. 


Wald  (1947)  has  shown  that  this  testing  procedure  results  in  error  levels 
approximating  a  and  {5  across  consignments.  Further,  it  has  been  shown  that  the 
probability  of  not  obtaining  a  decision  for  a  consignment  approaches  zero  as  the 
sample  size  increases. 


Ferguson’s  application  to  mastery  testing.  Ferguson  (1969)  has  applied  the 
SPRT  within  a  mastery  testing  situation  using  test  item  responses  in  place  of 
light  bulbs  and  a  domain  of  items  that  represents  an  instructional  objective 
instead  of  a  consignment.  The  quality  that  Ferguson  evaluated  was  students' 
command  of  the  content  area  being  tested.  Ferguson  also  branched  through  an 
instructional  hierarchy,  applying  the  SPRT  to  various  objectives  of  instruction. 
The  present  study,  however,  will  concentrate  on  the  application  of  SPRT  to  a 
single  instructional  unit. 
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To  employ  the  SPRT  in  a  mastery  testing  situation,  the  educator  must  speci¬ 
fy  the  following: 

1.  Two  criteria  of  performance  (p^  and  pp,  which  serve  as  the  lowest  lev¬ 
el  at  which  a  mastery  decision  will  be  made  and  the  highest  level  at 
which  a  nonmastery  decision  will  be  made  and  which  bound  the  uncertain¬ 
ty  region  in  which  testing  will  continue. 

2.  Two  levels  of  error  acceptance  (a  and  g),  which  determine  the  strict¬ 
ness  of  the  decision  test  and  should  reflect  the  relative  costs  of  the 
two  error  types. 

3.  A  maximum  test  length  to  constrain  the  testing  time  for  individuals  who 
are  very  difficult  to  classify. 

One  characteristic  of  this  form  of  adaptive  mastery  testing  is  that  it  is 
fairly  simple  to  implement  within  a  classroom  situation.  The  decision  rule  is 
easily  incorporated  into  a  chart  that  shows  the  teacher  or  the  student  how  many 
questions  need  to  be  answered  correctly  or  incorrectly  for  each  test  length  in 
order  to  terminate  the  test.  Once  the  charts  are  made  for  various  values  of  jPq, 

p^ ,  a,  and  $,  the  statistical  work  is  completed.  This  puts  the  power  of  the 

SPRT  procedure  into  the  hands  of  the  educator  quite  readily.  The  procedure  is 
not  fully  adaptive,  however.  Items  are  selected  at  random  or  in  a  fixed  se¬ 
quence;  it  is  only  the  test  length  that  varies  for  individuals. 

ICC-Based  Adaptive  Mastery  Testing  (AMT) 

The  paradigm  for  AMT  that  Kingsbury  and  Weiss  (1979)  have  proposed  makes 
use  of  ICC  theory  and  Bayesian  statistical  theory  to  adapt  the  mastery  test  to 
the  individual's  level  of  skill  during  the  testing  process.  ICC  theory  is  used 
to  estimate  the  parameters  that  most  efficiently  describe  each  of  the  items  in 
the  item  pool.  Given  these  parameter  estimates,  it  is  possible  to  prescribe  a 
type  of  adaptive  procedure  that  may  allow  mastery  decisions  that  are  quite  accu¬ 
rate  to  be  made  while  shortening  the  length  of  the  test  needed  for  most  individ¬ 
uals. 


The  AMT  procedure  is  based  on  three  integrated  procedures.  These  are  (1)  a 
procedure  for  individualizing  the  administration  of  test  items,  (2)  a  method  for 
converting  a  traditional  (proportion  correct)  mastery  level  to  the  latent 
achievement  metric,  and  (3)  a  procedure  for  making  mastery  decisions  using 
Bayesian  confidence  intervals. 

Individualized  item  selection.  To  make  mastery  testing  a  more  efficient 
process,  it  is  desirable  to  reduce  the  length  of  each  individual's  test  (1)  by 
eliminating  test  items  that  provide  little  information  concerning  an  individu¬ 
al's  achievement  level  and  (2)  by  terminating  the  AMT  procedure  after  enough 
information  has  been  gathered  so  that  the  mastery  decision  can  be  made  with  a 
high  degree  of  confidence.  To  operationalize  this  goal,  an  item  to  be  adminis¬ 
tered  to  an  individual  at  any  point  during  the  testing  procedure  is  selected  on 
the  basis  of  the  amount  of  information  that  the  item  provides  concerning  the 
individual's  achievement  level  estimate  at  that  point  in  the  test,  since  that 
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item  should  provide  the  most  efficient  use  of  testing  time.  A  procedure  that 
selects  and  administers  the  most  informative  item  at  each  point  in  an  adaptive 
test — the  maximum  information  search  and  selection  (MISS)  technique — has  been 
described  by  Brown  and  Weiss  (1977)  and  is  part  of  the  AMT  procedure. 

The  information  that  an  item  provides  at  each  point  along  the  achievement 
continuum  may  be  determined  using  the  ICC  model  that  is  assumed  to  underly  indi¬ 
viduals’  responses  to  test  items.  The  AMT  procedure  assumes  the  3-parameter 
logistic  ICC  model  (Birnbaum,  1968).  Using  this  model,  the  information  avail¬ 
able  in  any  item  is  (Birnbaum,  1968,  Equation  20.4.16) 

1,(0)  =  (1  -  e.'ltfa.ty*  [DLA 6)]  /  U>[DL.( 9)] 

+  o^2  [-DL  .  (0)  ]  ,  [8] 

1' 


where 

1.(0)  =  the  information  available  from  item  at  any  achievement  level, 

0; 

££  =  the  lower  asymptote  of  the  ICC  for  the  item; 

D  =  1.7,  a  scaling  factor  used  to  allow  the  logistic  ICC  to  closely 
approximate  a  normal  ogive; 

a  =  the  discriminatory  power  of  the  item  at  the  inflection  point  of  the 
1  ICC; 

¥  =  the  logistic  probability  density  function; 

L^(0)  =  £j[(0  -  b_i )  where  b^  is  the  difficulty  of  the  item;  and 
¥  =  the  cumulative  logistic  function. 

If  it  is  assumed  that  the  achievement  level  estimate  (0)  is  the  best  esti¬ 
mate  of  the  actual  achievement  level  (9),  the  item  information  of  each  of  the 

A  ( 

items  not  yet  administered  may  be  evaluated  at  0  at  any  point  during  the  test. 
The  item  that  has  the  highest  information  value  at  the  individual's  current  lev¬ 
el  of  0  is  thus  chosen  to  be  administered  next. 

For  this  study  a  Bayesian  estimator  of  the  individual's  achievement  level, 
developed  by  Owen  (1969),  was  used.  This  estimation  procedure  has  been  shown  to 
yield  biased  estimates  of  trait  levels  (Kingsbury  &  Weiss,  1979;  McBride  & 

Weiss,  1976).  This  bias  may  be  attributed  to  the  assumption  of  a  normal  distri¬ 
bution  of  0  in  the  population  made  by  Owen's  procedure  or  due  to  inappropriate 
prior  information  concerning  9  on  the  individual  level  (Kingsbury  &  Weiss, 
1979b).  The  bias  inherent  in  this  scoring  strategy  may  render  the  MISS  tech¬ 
nique  less  efficient  than  it  would  be  under  optimal  conditions,  thereby  reducing 
the  efficiency  of  the  AMT  technique  as  a  whole. 

To  use  MISS  under  optimal  conditions,  trait  level  estimates  should  be  ob¬ 
tained  by  maximum  likelihood  estimation,  which  yields  asymptotically  efficient 
estimates  (Birnbaum,  1968).  Maximum  likelihood  estimation  techniques  are  not 
able,  however,  to  obtain  trait  level  estimates  for  consistent  item  response  pat¬ 
terns  (either  all  correct  or  all  incorrect)  or  for  item  response  patterns  for 
which  the  likelihood  function  is  extremely  flat.  The  Bayesian  technique  will 
yield  an  estimate  for  any  response  vector.  This  inability  to  estimate  9  for 
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some  response  patterns  mitigated  against  the  use  of  a  maximum  likelihood  estima¬ 
tion  procedure  for  AMT.  Consequently,  the  Bayesian  estimation  procedure  was 
used  in  the  AMT  procedure  on  the  assumption  that  the  capability  to  obtain  a  0 
estimate  for  each  individual  at  each  point  during  the  test  would  outweigh  any 
efficiency  lost  due  to  the  bias  inherent  in  the  estimation  procedure.  The  use 
of  the  Bayesian  estimation  strategy  in  this  study  also  allowed  the  use  of  easily 
interpretable  Bayesian  confidence  intervals  to  make  the  mastery  decision. 

Mastery  level.  The  classical  mastery  testing  procedure  specifies  a  per¬ 
centage  of  the  items  on  a  test  that  must  be  correctly  answered  by  an  individual 
in  order  for  him/her  to  be  declared  a  master.  Using  ICC  theory,  it  is  possible 
to  generate  an  analog  to  the  percentage  cutoff  of  classical  theory  for  use  in 
adaptive  testing,  even  though  the  use  of  MISS  will  tend  to  result  in  each  person 
answering  about  50%  of  the  items  correctly,  given  a  large  enough  item  pool  (be¬ 
cause  items  administered  will  most  probably  be  close  to  the  individual's  level 
of  0).  The  analog  is  based  on  the  use  of  the  test  characteristic  curve  (TCC; 
Lord  &  Novick,  1968).  The  TCC  is  the  function  that  relates  the  achievement  con¬ 
tinuum  to  the  expected  proportion  of  correct  answers  that  a  person  at  any  level 
of  0  may  be  expected  to  obtain  if  all  of  the  items  on  the  test  are  administered. 


For  this  procedure  the  assumption  was  made  that  a  3-parameter  logistic 
ogive  described  the  functional  relationship  between  the  latent  trait  (achieve¬ 
ment)  and  the  probability  of  observing  a  correct  response  to  any  of  the  items  on 
the  test.  This  assumption  yields  a  TCC  of  the  following  form: 


£(P|0) 


n 

l  (1 
i  =  1 


1  +  exptl-7 a^(b^  -  6)]  J 
°  i'*  +  °i  exp  [  1 . 7a^  (b  ^  -  0)]  J  n 


[9] 


where 

E ( P 1 0 )  =  the  expected  value  of  the  proportion  of  correct  answers  observed 
on  the  test  given  at  any  achievement  level; 
n  =  the  number  of  items  on  the  test; 

Cj  =  the  estimate  of  the  lower  asymptote  for  the  ICC  of  item  i; 

a;  =  the  estimate  of  the  discriminatory  power  for  the  item; 

b^  =  the  estimate  of  the  difficulty  of  the  item;  and 

0  =  any  given  achievement  level. 


This  monotonically  increasing  function  enables  the  expression  of  any  given 
level  of  9  to  its  most  likely  proportion  correct  or,  more  importantly  in  this 
context,  to  determine  the  level  of  0  that  will  most  probably  result  in  any  given 
proportion  of  correct  answers.  To  exemplify  the  use  of  the  TCC  in  determining  a 
level  of  0  that  is  comparable  to  a  desired  percentage  mastery  level,  a  hypothet¬ 
ical  TCC  is  shown  in  Figure  1.  Assuming  that  some  items  from  the  test  repre¬ 
sented  by  this  TCC  are  to  be  administered  in  some  adaptive  manner  (e.g.,  MISS) 
and  that  a  level  of  0  is  to  be  determined  that  corresponds  to,  say,  70%  correct 
performance  on  the  entire  test,  it  may  be  done  using  the  following  steps: 

1.  Draw  a  horizontal  line  (Line  A  in  Figure  1)  from  the  .7  mark  on  the 
vertical  (expected  proportion  correct,  or  P)  axis  of  the  TCC  figure  to 
the  TCC. 


Drop  a  vertical  line  (Line  B)  from  the  point  of  intersection  of  the  ICC 
and  the  horizontal  line  drawn  in  Step  1  to  the  horizontal  (achievement 
level,  or  0 )  axis.  This  point  (6m)  on  the  achievement  level  axis  is 
designated  the  mastery  level  in  terms  of  the  achievement  (0)  metric. 

The  mastery  level  specified  in  Step  2  above  may  now  be  used  to  make 
mastery  decisions  in  place  of  the  .7  mastery  level  originally  specified 
using  any  subset  of  items  from  the  original  test,  provided  that  indi¬ 
viduals'  item  responses  are  scored  with  a  method  that  will  put  the  0 
estimate  on  the  same  metric  as  the  TCC.  Any  ICC-based  scoring  proce¬ 
dure  (e.g.,  Bejar  &  Weiss,  1979)  will  result  in  a  0  estimate  that  will 
be  on  the  correct  metric.  This  procedure  allows  the  transformation  of 
any  desired  proportion  correct  mastery  level  to  the  0  metric  Once 
this  transformation  is  made,  ICC  theory  and  its  technology  may  be  used 
to  increase  the  efficiency  of  present  mastery  testing  techniques. 

Figure  1 

Hypothetical  Test  Characteristic  Curve  Illustrating  Conversion 

from  the  Proportion  Correct  Metric  to  the  Achievement  Metric 


Achievement  (6)  Level 


Making  the  mastery  decision  using  Bayesian  confidence  intervals.  Although 
any  achievement  level  estimate  of  any  subset  of  the  items  from  a  test  obtained 
using  ICC-based  scoring  will  be  on  the  same  metric  as  the  TCC  for  the  original 
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test,  two  different  subsets  of  items  may  result  in  0  estimates  that  are  not 
equally  informative.  For  example,  if  one  test  consisted  of  many  items  and  the 
other  used  only  a  few  items,  the  longer  test  would  probably  yield  a  more  precise 
9  estimate,  provided  that  the  items  in  the  two  tests  had  similar  ICCs.  Thus, 
ICC-based  0  estimates  that  are  on  the  same  metric  are  comparable  except  for 
their  differential  precision.  Comparisons  of  ICC-based  0  estimates  should 
therefore  be  based  on  confidence  interval  estimates  instead  of  the  raw  achieve¬ 
ment  level  point  estimates. 

For  this  reason,  the  AMT  strategy  makes  mastery  decisions  with  the  use  of 
Bayesian  confidence  intervals.  Specifically,  after  each  item  is  selected  and 
administered  to  an  individual--for  this  application  MISS  is  used  to  choose  the 
appropriate  item  at  each  point  in  the  test —  a  point  estimator  of  the  individu¬ 
al's  achievement  level  (0)  may  be  determined  using  Owen's  Bayesian  scoring  algo¬ 
rithm,  using  information  gained  from  all  items  administered  previously.  Given 
this  point  estimate  and  the  corresponding  variance  estimate,  also  obtained  using 
Owen's  procedure,  a  Bayesian  confidence  interval  may  be  defined  such  that 

-  V 

0  -1.96(o2)5  <  0  <  Q  .  +  1.96(o*. )  with  p  =  .95  [10] 

where 

0^  =  the  Bayesian  point  estimate  of  achievement  level,  calculated  following 
item  i ; 

2  —  .  .  .  ... 

0£  =  the  Bayesian  posterior  variance  following  item  i;  and 

0  =  the  true  achievement  level. 

Equation  10  may  be  interpreted  as  meaning  that  the  probability  is  .95  that  the 
true  value  of  the  achievement  level  parameter,  0,  is  within  the  bounds  of  the 
confidence  inteval.  It  might  also  be  said  that  there  was  95%  confidence  that 
the  true  parameter  value  lies  within  the  confidence  interval. 

After  this  confidence  interval  has  been  generated,  it  is  a  simple  matter  to 
determine  whether  or  not  0m,  the  achievement  level  earlier  designated  as  the 

mastery  level  on  the  achievement  metric,  falls  outside  the  limits  of  the  confi¬ 
dence  interval.  If  it  does  not,  the  testing  procedure  administers  another  item 
to  the  individual  and  recalculates  the  confidence  interval.  This  procedure  con¬ 
tinues  until,  after  some  item  has  been  administered,  the  confidence  interval 
calculated  will  not  include  0m,  the  mastery  level  on  the  achievement  continuum. 
At  this  time  the  testing  procedure  terminates  and  a  mastery  decision  is  made. 

If  the  lower  limit  of  the  confidence  interval  falls  above  the  specified  mastery 
level,  0m,  the  individual  is  declared  a  master;  if  the  upper  limit  of  the  confi¬ 
dence  interval  falls  below  0 m ,  the  individual  is  declared  a  nonmaster.  Given  a 
finite  item  pool  size,  however,  the  testing  procedure  may  exhaust  the  pool  be¬ 
fore  a  decision  can  be  made  in  this  manner.  It  is  possible  to  make  a  decision 
concerning  mastery  for  any  of  these  individuals  based  on  whether  the  Bayesian 
point  estimate  of  their  achievement  level  (0)  is  above  or  below  the  specified 
mastery  level,  0m.  These  decisions,  however,  cannot  be  made  with  the  same  de¬ 
gree  of  confidence  as  those  made  with  confidence  intervals  that  do  not  contain 
the  mastery  level. 
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Wald's  SPRT  versus  ICC -Based  AMT  Procedure 


The  two  mastery  testing  strategies  described  above  differ  in  a  number  of 
characteristics.  The  most  salient  of  these  differences  are  as  follows: 

1.  Treatment  of  the  items  in  the  domain. 

2.  Treatment  of  the  uncertainty  of  decisions. 

3.  Treatment  of  the  mastery  cutoff. 

4.  Treatment  of  the  achievement  metric. 

Treatment  of  items.  The  SPRT  in  the  simple  form  outlined  above,  treats  all 
of  the  items  in  the  mastery  test  as  if  they  were  perfect  replicates  of  each 
other.  Thus,  an  individual's  response  to  a  particular  item  is  viewed  solely  as 
a  probabilistic  function  of  the  individual's  true  mastery  status.  This  assump¬ 
tion  is  most  appropriate  in  the  production  setting  in  which  Wald  originally  de¬ 
signed  his  procedure;  each  light  bulb  can  be  expected  to  be  like  every  other 
light  bulb.  This  assumption  may  be  less  tenable  in  the  mastery  testing  situa¬ 
tion,  where  an  individual's  responses  to  test  items  may  vary  as  a  function  of 
differential  characteristics  of  the  items  themselves,  as  well  as  his/her  mastery 
status. 

The  AMT  procedure  assumes  that  if  items  differ,  their  individual  character¬ 
istics  may  be  described  by  a  logistic  ogive  that  varies  as  a  function  of  the 
item's  power  to  discriminate  among  individuals  with  different  achievement  levels 
( a) ,  the  item's  difficulty  (lb),  and  the  ease  with  which  an  individual  may  answer 
the  item  correctly  with  no  knowledge  of  the  subject  mattery  (c).  This  assump¬ 
tion  concerning  the  operating  characteristics  of  the  items  is  less  restrictive 
than  the  assumption  made  in  the  SPRT  procedure  described  above;  but  to  the  ex¬ 
tent  that  the  items  do  not  conform  to  the  logistic  form  specified,  the  assump¬ 
tion  might  still  restrict  the  efficiency  of  the  AMT  procedure. 

Both  mastery  testing  procedures,  therefore,  postulate  some  systematic  simi¬ 
larities  among  the  test  items.  To  the  extent  that  one  of  the  postulations  is 
closer  to  the  actual  state  of  the  world  than  the  other,  it  might  be  expected 
that  the  corresponding  procedure  would  perform  more  efficiently.  Thus,  the 
characteristics  of  the  item  pool  to  be  used  for  mastery  testing  yields  the  first 
point  at  which  it  might  be  decided  which  of  the  two  models  is  more  appropriate 
for  use  in  a  given  situation. 

Treatment  of  uncertainty.  The  SPRT  makes  use  of  traditional  hypothesis 
testing  methods  to  determine  the  point  at  which  an  individual's  item  responses 
are  sufficient  evidence  for  making  a  decision  concerning  his/her  mastery  status. 
Here  "sufficient"  is  defined  in  terms  of  the  a  and  $  error  rates  that  one  is 
willing  to  accept  across  all  the  students  tested,  a  and  g  may  be  set  indepen¬ 
dently  to  reflect  the  educator's  concerns  over  the  relative  costs  of  the  two 
error  types. 

The  AMT  procedure  uses  a  symmetric  Bayesian  confidence  interval  to  make  the 
mastery  decision.  This  functionally  sets  a  equal  to  g  and,  by  doing  so,  implies 
equal  costs  for  the  two  error  types.  To  the  extent  that  the  costs  of  the  two 
error  types  are  not  equal,  the  SPRT  provides  the  educator  with  more  flexibility 
than  the  AMT  procedure,  as  currently  operationalized. 
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Treatment  of  mastery  level.  The  SPRT  uses  an  uncertainty  region,  rather 
than  a  single  mastery  level,  to  define  the  mastery  and  nonmastery  regions.  The 
specification  of  this  uncertainty  region  is  based  on  a  decision  by  the  educator 
concerning  the  range  that  appropriately  reflects  uncertainty  as  to  whether  the 
student's  performance  is  actually  the  performance  of  a  master  or  a  nonmaster. 

By  contrast,  the  AMT  procedure  defines  a  single  mastery  level  and  determines 
whether  an  individual  is  significantly  above  or  below  the  mastery  level  using  a 
Bayesian  confidence  interval. 

This  difference  between  the  two  testing  procedures  renders  tentative  any 
comparison  that  might  be  made.  The  performance  of  the  SPRT  procedure  will  vary 
widely  as  a  function  of  the  uncertainty  band  chosen.  For  the  AMT  technique  this 
uncertainty  is  not  directly  taken  into  account.  Any  comparison  between  the  two 
techniques  is  conditional  upon  the  width  and  absolute  bounds  of  the  uncertainty 
region. 

Treatment  of  the  8  metric.  The  decisions  made  by  the  SPRT  are  dependent  on 
the  percentage  of  items  that  are  correctly  answered  for  any  specific  test 
length.  Thus,  the  metric  of  achievement  assumed  in  this  procedure  is  the  pro- 
portion-correct  metric.  The  AMT  procedure  assumes,  due  to  the  differential 
properties  of  the  items  in  the  item  pool,  that  there  is  a  nonlinear  transforma¬ 
tion  of  the  proportion-correct  metric,  which  more  accurately  represents  the 
achievement  of  the  individuals  taking  the  test.  This  latent  continuum  serves  as 
the  achievement  metric  for  the  AMT  procedure. 

This  difference  in  the  achievement  metric  again  renders  comparisons  between 
the  two  procedures  somewhat  difficult,  since  the  "true"  achievement  levels  of 
individuals  must  be  postulated  to  fit  one  of  these  metrics.  Any  differences 
noted  in  the  performance  of  the  two  procedures  may  be  due  to  this  difference  in 
the  achievement  metrics  assumed. 


EMPIRICAL  COMPARISON  OF  THE  SPRT  AND  AMT  PROCEDURES 

To  delineate  circumstances  in  which  one  of  the  mastery  testing  procedures 
might  have  an  advantage  over  the  other,  monte  carlo  simulation  was  used  to  com¬ 
pare  the  two  testing  procedures  under  several  conditions. 

Method 


The  method  used  to  compare  the  two  variable-length  mastery  testing  proce¬ 
dures  to  one  another,  as  well  as  to  a  conventional  (fixed  length)  testing  proce¬ 
dure,  consisted  of  five  basic  steps: 

1.  Three  item  pools  were  generated  in  which  the  items  differed  from  one 
another  to  different  degrees. 

2.  Item  responses  were  generated  for  500  simulated  subjects  (simulees)  for 
each  of  the  items  in  the  three  item  pools. 

3.  Conventional  tests  of  three  different  lengths  were  drawn  from  the  larg- 


er  item  pools;  these  conventional  tests  served  as  item  pools  from  which 
the  SPRT  and  AMT  procedures  drew  items. 

4.  The  AMT  and  SPRT  procedures  were  simulated  for  each  of  the  three  dif¬ 
ferent  item  pool  types  and  the  three  conventional  test  lengths. 

5.  Comparisons  were  drawn  among  the  three  types  of  tests  (AMT,  SPRT,  con¬ 
ventional)  concerning  the  degree  of  correspondence  between  the  deci¬ 
sions  made  by  the  three  test  types  and  the  true  mastery  status.  Fur¬ 
ther  comparisons  were  made  based  on  the  average  test  length  that  each 
test  type  required  to  reach  its  decisions. 

Item  Pool  Generation 

Three  100-item  pools  were  generated  to  reflect  different  types  of  pools 
that  might  be  used  in  a  mastery  test. 

Uniform  pool.  The  uniform  pool  consisted  of  100  items  that  were  perfect 
replications  of  one  another.  Each  item  had  the  same  discrimination  (a  =  1.00), 
difficulty  (b  =  0.00),  and  guessing  probability  (c  =  .20).  This  pool  was  de¬ 
signed  to  correspond  to  the  SPRT  procedure's  assumption  that  all  items  in  the 
test  are  similar. 

b-variable  pool.  The  _b-variable  pool  varied  from  the  uniform  pool  only  in 
that  the  items  had  a  range  of  difficulty  levels.  Eleven  values  of  b  were 
assigned  to  an  approximately  equal  number  of  items  in  the  pool.  The  values  of  b 
chosen  were  -2.50,  -2.00,  -1.50,  -1.00,  -0.50,  0.00,  0.50,  1.00,  1.50,  2.00,  anS 
2.50.  Nine  items  at  each  level  of  difficulty  were  used  in  this  pool,  along  with 
an  additional  item  with  b_  =  0.00  to  bring  the  pool  to  100  items. 

a-,  b-,  and  c-variable  pool.  The  a-,  b-,  and  c-variable  pool  differed  from 
the  b-variable  pool  in  that  the  discriminations  and  guessing  levels  of  the  items 
were  allowed  to  spread  across  a  range  of  values.  The  a  values  used  were  .50, 

1  .00,  1.50,  and  2.00.  The  c_  values  used  were  .10,  .20,  and  .30.  All  a_  and  c^ 
values  were  approximately  equally  represented.  The  parameter  estimates  were 
arranged  such  that  each  level  of  difficulty  was  represented  by  items  that  had 
approximately  the  same  average  a_  level  and  the  same  average  c  level  (i.e.,  the 
pool  was  approximately  rectangular). 

Item  Response  Generation 

Achievement  levels  for  500  simulees  were  drawn  from  a  normal  distribution 
with  a  mean  of  zero  and  a  standard  deviation  of  one.  Item  responses  for  each  of 
these  simulees  were  then  generated  for  each  item  in  each  of  the  three  item  pools 
using  the  3-parameter  logistic  ICC  model.  That  is,  knowing  the  9  level  of  the 
simulee  and  the  parameters  of  the  item  in  question,  the  probability  of  a  correct 
respon?e  was  calculated.  A  random  number  was  then  drawn  from  a  uniform  distri¬ 
bution  ranging  from  zero  to  one.  If  this  number  was  lower  than  the  probability 
of  a  correct  response,  the  simulee  was  given  a  correct  response  to  the  item.  If 
the  number  was  higher  than  the  correct  response  probability,  the  simulee  was 
given  an  incorrect  response. 
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Thus,  in  this  study,  the  achievement  metric  and  the  item  response  generator 
correspond  closely  to  the  model  assumed  by  the  AMT  procedure.  The  "true"  mas¬ 
tery  level  for  each  simulee  was  determined  by  comparing  the  0  levels  used  to 
generate  the  item  responses  with  the  proportion  correct  mastery  level  expressed 
on  the  0  metric. 

Conventional  Tests 


Conventional  tests  of  three  different  lengths  (10,  25,  and  50  items)  were 
drawn  at  random  from  each  of  the  three  item  dooIs,  with  the  stipulation  that  the 
shortest  conventional  test  served  as  the  first  portion  of  the  next  longer  con¬ 
ventional  test  and  that  this  test  in  turn  served  as  the  first  portion  of  the 
longest  conventional  test.  These  nine  conventional  tests  served  as  subpools 
from  which  the  AMT  and  SPRT  procedures  drew  items  during  the  simulations.  This 
random  sampling  from  a  larger  domain  of  items  was  designed  to  correspond  to  the 
traditional  mastery  testing  paradigm  and  to  the  random  sampling  model  underlying 
the  SPRT. 

Simulation  of  the  Testing  Strategies 

Using  the  item  response  data  for  the  500  individuals  and  the  item  parame¬ 
ters  available  for  each  of  the  items  (for  the  AMT  procedure),  the  three  testing 
strategies  (AMT,  SPRT,  conventional)  were  employed  to  make  mastery  decisions  for 
each  individual.  Each  testing  procedure  was  used  with  each  of  the  nine  sub¬ 
pools  . 

Conventional  test.  The  conventional  test  assumed  a  mastery  criterion  of 
60%  correct  responses.  After  all  of  the  items  in  the  conventional  test  were 
administered,  if  the  individual  answered  60%  or  more  items  correctly,  the  indi¬ 
vidual  was  declared  a  master.  If  the  individual's  score  was  less  than  60%  cor¬ 
rect,  the  individual  was  declared  a  nonmaster. 

SPRT  procedure.  For  the  SPRT  procedure  the  limits  of  the  uncertainty  re¬ 
gion  were  set  at  proportion-correct  values  of  .50  and  .70.  Values  of  a  and  0 
were  each  set  to  .10.  For  individuals  for  whom  no  decision  was  made  by  the  Wald 
procedure  before  the  item  pool  was  exhausted,  the  mastery  decision  was  made  by 
the  conventional  procedure,  using  a  mastery  proportion  of  .60. 

AMT.  For  the  AMT  procedure  the  mastery  levels  in  each  of  the  100-item 
pools  corresponding  to  60%  correct  were  designed  to  be  equal  to  0  =  0.00.  This 
mastery  level  was  used  with  each  of  the  smaller  item  pools,  even  though  they  had 
not  been  designed  to  result  in  a  mastery  level  of  0  =  0.00.  This  procedure  ad¬ 
ded  some  sampling  error  to  the  AMT  procedure,  to  more  appropriately  reflect  the 
error  that  is  inherent  when  using  estimated  item  parameters  to  determine  the 
mastery  level.  For  the  AMT  Bayesian  scoring  procedure,  each  individual  was  as¬ 
sumed  to  have  a  prior  mean  of  0.00  and  a  prior  variance  of  1.00. 

Comparison  among  Testing  Procedures 


For  each  of  the  three  testing  procedures  (AMT,  SPRT,  conventional),  the 
value  of  the  procedure  may  be  judged  by  the  average  length  of  the  test  required 


to  make  the  mastery  decision  and  by  how  well  the  decisions  that  are  made  reflect 
the  true  state  of  nature.  Specifically,  the  AMT  and  SPRT  procedures  were  com¬ 
pared  in  terms  of  the  average  reduction  in  the  length  of  the  test  required  to 
make  mastery  decisions  across  the  entire  group  of  individuals.  Further,  all 
three  procedures  were  compared  in  terms  of  how  well  the  decisions  they  made  cor¬ 
responded  with  the  true  mastery  status  of  the  individuals. 

Comparisons  within  each  testing  procedure  concerning  the  average  test 
length  and  the  correspondence  of  decisions  with  true  mastery  status  were  made 
across  all  nine  combinations  of  test  lengths  and  item  pool  types. 


RESULTS 


Test  Length 

Table  1  shows  the  mean  test  length  required  by  each  of  the  testing  proce¬ 
dures  to  make  a  decision  concerning  the  mastery  status  of  the  simulees  in  the 
test  group. 


Table  1 

Mean  Number  of  Items  Administered  to  Each  Simulee 
for  Three  Mastery  Testing  Strategies  Using  Each  Type 
of  Item  Pool,  at  Three  Maximum  Test  Lengths 


Maximum  Test  Length 


Item  Pool  and 

Testing  Strategy 

10 

25 

50 

Uniform  Pool 

Conventional 

10.00 

25.00 

50.00 

AMT 

9.03 

15.99 

23.00 

SPRT 

8.75 

13.12 

15.39 

b-Variable  Pool 

Conventional 

10.00 

25.00 

50.00 

AMT 

9.43 

18.09 

27.17 

SPRT 

9.62 

16.79 

21.41 

a-,  b-,  and  c-Variable  Pool 

Conventional 

10.00 

25.00 

50.00 

AMT 

8.73 

16.35 

23.39 

SPRT 

8.62 

13.42 

15.70 

Uniform  pool.  As  can  be  seen  from  Table  1,  the  AMT  procedure  resulted  in 
some  test  length  reduction  for  each  maximum  test  length  (MTL),  with  the  reduc¬ 
tion  in  test  length  increasing  as  the  MTL  increased.  For  the  10-item  MTL,  the 
percentage  by  which  the  conventional  test  length  was  reduced  was  9.7%;  for  the 
25-item  MTL  the  reduction  was  36%;  and  for  the  50-item  MTL  the  observed  reduc¬ 
tion  was  54%. 


For  the  SPRT  procedure,  again,  increasing  test  length  reduction  was  noted 
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as  MTL  increased;  and  some  reduction  was  noted  at  each  level  of  MTL.  For  the 
10-item  MTL,  the  reduction  observed  was  12%.  The  25-item  MTL  resulted  in  a  482 
reduction.  For  the  50-item  MTL  the  reduction  was  69%.  At  all  MTL  levels  the 
SPRT  procedure  resulted  in  a  greater  reduction  of  test  length  than  the  AMT  pro¬ 
cedure. 

b-variable  pool.  For  the  pool  in  which  the  difficulty  levels  of  the  items 
differed,  the  data  in  Table  1  show  the  same  trends  that  were  noted  for  the  uni¬ 
form  pool.  The  AMT  procedure  reduced  the  test  length  at  each  MTL,  and  the  re¬ 
duction  increased  with  the  MTL  level.  For  the  10-item,  25-item,  and  50-item  MTL 
levels,  the  AMT  procedure  reduced  test  length  by  6%,  28%,  and  46%,  respectively. 

The  SPRT  procedure  also  reduced  test  length  at  each  MTL  level,  with  larger 
reductions  for  the  longer  MTL  levels.  At  the  10-item,  25-item,  and  50-item  MTL 
levels  the  test  length  reductions  observed  were  4%,  33%,  and  57%,  respectively. 

For  this  pool  the  AMT  procedure  resulted  in  slightly  greater  reduction  in 
test  length  at  the  10-item  MTL  level,  whereas  the  SPRT  procedure  resulted  in 
greater  test  length  reductions  for  the  longer  MTL  levels.  Across  all  MTL  lev¬ 
els,  both  procedures  reduced  test  length  somewhat  less  for  this  item  pool  than 
for  the  uniform  item  pool. 

a-,  b- ,  and  c-variable  pool.  Table  1  shows  that  when  the  AMT  procedure  was 
used  with  this  item  pool,  test  length  was  again  reduced  at  each  MTL  and  this 
reduction  was  greater  for  the  longer  MTL  levels.  For  the  10-item,  25-item,  and 
50-item  MTL  levels,  the  observed  reductions  in  test  length  were  13%,  35%,  and 
53%,  respectively. 

For  the  SPRT  procedure  with  this  item  pool,  test  length  reduction  was  once 
more  observed,  with  an  increasing  reduction  as  the  MTL  increased.  The  reduc¬ 
tions  noted  were  14%,  46%,  and  69%  for  the  10-item,  25-item,  and  50-item  MTL 
levels. 

For  this  item  pool  the  SPRT  procedure  terminated  using  a  smaller  average 
number  of  items  for  each  MTL.  Further,  the  degree  of  test  length  reduction  in 
this  pool  for  both  procedures,  at  all  MTL  levels,  was  quite  similar  to  that  ob¬ 
served  for  the  uniform  item  pool. 

Correspondence  with  True  Mastery  Status 

For  each  of  the  siraulees  in  the  sample,  the  true  0  level  was  known:  It  was 
the  level  that  was  used  to  generate  the  item  responses.  Given  this,  it  was 
known  whether  the  individual's  0  level  was  actually  above  or  below  the  prespeci¬ 
fied  mastery  level  on  the  achievement  metric  (0  ■  0.00).  Phi  correlations  be¬ 
tween  true  mastery  status  and  the  mastery  state  determined  by  each  of  the  three 
testing  procedures  for  each  MTL  level  and  pool  type  are  shown  in  Table  2. 

Uniform  pool.  For  the  uniform  pool  one  major  trend  was  observed.  For  each 
testing  procedure  an  increase  in  the  MTL  level  was  accompanied  by  an  increase  in 
the  correlation  between  the  true  and  estimated  mastery  states.  (These  correla¬ 
tions  may  be  referred  to  as  correspondence  coefficients.) 
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Table  2 

Phi  Correlations  Between  Observed  Mastery 
State  and  True  Mastery  State  for  Each  Mastery 
Testing  Strategy,  Using  Each  Type  of  Item  Pool, 
at  Three  Maximum  Test  Lengths 


Item  Pool  and 

Testing  Strategy 

Maximum 

10 

Test 

25 

Length 

50 

Uniform  Pool 

Conventional 

.771 

.837 

.875 

AMT 

.775 

.840 

.871 

SPRT 

.771 

.837 

.867 

b-Variable  Pool 

Conventional 

.541 

.667 

.783 

AMT 

.615 

.715 

.828 

SPRT 

.541 

.656 

.704 

a-,  b-,  and  c-Variable  Pool 

Convent ional 

.290 

.670 

.735 

AMT 

.470 

.733 

.787 

SPRT 

.290 

.592 

.571 

In  addition  to  this  major  trend,  it  was  observed  that  for  the  10-item  and 
25-item  MTL  levels,  the  AMT  procedure  produced  the  highest  correspondence  coef¬ 
ficient  observed  (r  =  .775  and  .840,  respectively).  For  the  50-item  MTL  level 
the  conventional  procedure  resulted  in  the  highest  correspondence  (r  =  .871). 

It  should  be  noted  that  the  differences  in  correspondence  between  any  two 
MTL  levels  within  any  testing  procedure  (the  smallest  was  .03,  between  the 
25-item  and  50-item  MTL  levels  for  the  SPRT  procedure)  were  much  larger  than  the 
largest  difference  noted  between  any  two  testing  procedures  within  a  single  MTL 
level  (.008,  for  the  conventional  and  SPRT  procedures  in  the  50-item  MTL  level). 

b-variable  pool.  The  same  major  trend  that  was  found  for  the  uniform  pool 
was  again  observed  in  the  b-variable  pool.  Each  testing  strategy  resulted  in 
higher  correspondence  as  the  MTL  level  increased.  For  each  MTL  level,  the  AMT 
procedure  resulted  in  the  highest  correspondence  coefficients.  The  conventional 
procedure  resulted  in  the  next  highest  correspondence  level  for  all  three  MTL 
levels  (tied  with  the  SPRT  procedure  at  the  10-item  MTL  level). 

Differences  in  correspondence  coefficients  observed  between  testing  strate¬ 
gies  within  an  MTL  level  were  larger  in  this  pool  than  in  the  uniform  pool  but 
were  still  somewhat  smaller  than  the  differences  noted  between  MTL  levels,  on 
the  average.  It  was  also  noted  that  each  correspondence  level  observed  was  low¬ 
er  for  this  pool  than  for  the  uniform  pool  across  all  MTL  levels  and  testing 
procedures . 

a-,  b- ,  and  c-variable  pool.  The  same  trend  of  increasing  correspondence 
with  increasing  MTL  level  was  again  noted  for  the  conventional  and  AMT  proce- 
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dures.  For  the  SPRT  procedure  the  correspondence  peaked  at  r  =  .592  at  the 
25-item  MTL  level  and  dropped  to  .571  at  the  50-item  MTL  level. 

The  AMT  procedure  produced  the  highest  correspondence  for  all  three  MTL 
levels.  The  conventional  procedure  resulted  in  the  next  highest  level  of  per¬ 
formance  at  all  MTL  levels  (again  tied  with  the  SPRT  procedure  at  the  10-item 
MTL  level). 

Once  again,  the  average  difference  in  correspondence  was  much  greater  be¬ 
tween  MTL  levels  within  testing  strategies  than  between  two  testing  strategies 
within  a  single  MTL  level.  Further,  on  the  average,  the  correspondence  coeffi¬ 
cients  for  this  pool  were  lower  than  for  either  of  the  other  pools,  with  rather 
large  decreases  at  the  10-item  MTL  level,  particularly  for  the  conventional  and 
SPRT  strategies. 

Frequency  and  Type  of  Errors 

To  further  compare  the  performance  of  the  three  mastery  testing  strategies 
the  frequency  with  which  each  procedure  made  incorrect  decisions  (false  mastery, 
false  nonmastery)  was  examined;  the  percentage  of  decision  errors  made  by  each 
of  the  testing  strategies  with  each  of  the  item  pools  at  each  MTL  is  shown  in 
Table  3.  This  table  shows  the  frequency  with  which  each  of  the  testing  proce¬ 
dures  made  false  mastery  and  false  nonmastery  decisions  in  each  of  the  testing 
conditions.  It  may  be  noted  that  the  "Total"  column  in  Table  3  reproduces  the 
information  already  reported  from  the  correlational  analysis,  but  in  a  different 
manner.  For  each  situation  in  which  a  high  correlation  was  noted,  a  correspond¬ 
ingly  low  total  error  rate  is  noted  in  Table  3,  as  expected. 

Uniform  pool.  For  the  uniform  pool  each  of  the  testing  strategies  resulted 
in  the  same  general  pattern  of  errors  across  MTL  levels.  Each  procedure  result¬ 
ed  in  more  false  nonmastery  decisions  than  false  mastery  decisions  at  all  MTL 
levels.  Each  procedure  also  resulted  in  fewer  errors  of  each  type  with  in¬ 
creased  MTL.  The  difference  in  the  frequencies  of  false  mastery  and  false  non¬ 
mastery  decisions  was  smaller  with  larger  MTL  levels  for  all  procedures.  The 
differences  among  the  procedures  in  terms  of  the  types  of  false  decisions  made 
were  minimal. 

b-variable  pool.  For  this  item  pool  the  patterns  of  errors  made  by  the 
different  testing  strategies  were  less  regular  than  in  the  uniform  pool.  The 
conventional  and  SPRT  procedures  produced  more  false  mastery  than  false  nonmas¬ 
tery  decisions  at  all  MTL  levels.  The  AMT  procedure  produced  more  false  mastery 
than  false  nonmastery  decisions  at  the  10-item  MTL  level  but  produced  more  false 
nonmastery  than  false  mastery  decisions  at  the  two  higher  MTL  levels.  For  the 
AMT  procedure  the  discrepancy  in  the  frequencies  of  the  two  types  of  errors  was 
smaller  than  for  the  other  two  procedures  at  all  three  MTL  levels  and  was  quite 
small  (less  than  2%)  at  the  two  higher  MTL  levels.  For  the  conventional  proce¬ 
dure  the  difference  in  the  frequencies  of  the  two  types  of  errors  was  quite 
small  at  the  highest  MTL  level;  but  for  the  SPRT  procedure,  a  fairly  large  dis¬ 
crepancy  between  the  two  error  rates  (20%  to  80%)  was  observed  at  each  MTL  lev¬ 
el. 
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In  all  testing  conditions  but  one  (AMT  with  a  25-item  MTL),  the  use  of  the 
^-variable  item  pool  resulted  in  higher  discrepencies  between  the  two  observed 
error  rates  (as  well  as  higher  absolute  error  rates)  than  when  the  uniform  pool 
was  used. 


a-,  b- ,  and  c-variable  pool.  For  this  item  pool,  each  of  the  testing  pro¬ 
cedures  resulted  in  higher  frequencies  of  false  nonmastery  decisions  than  false 
mastery  decisions  for  the  10-item  and  25-item  MTL  levels.  For  the  50-item  MTL 
level  the  conventional  procedure  resulted  in  a  higher  frequency  of  false  mastery 
decisions,  but  the  AMT  and  SPRT  procedures  still  resulted  in  higher  percentages 
of  false  nonmastery  decisions.  As  with  the  b-variable  item  pool,  the  AMT  proce¬ 
dure  used  with  this  item  pool  resulted  in  smaller  differences  in  the  frequencies 
of  the  two  error  types  than  either  of  the  other  testing  procedures  at  each  MTL 
level.  For  the  50-item  MTL  level  the  AMT  procedure  produced  a  very  small  dif¬ 
ference  in  the  two  error  rates  (.6%).  The  conventional  procedure  also  produced 
a  small  difference  in  the  two  error  rates  for  the  50-item  MTL  level  (1.6%).  The 
SPRT  procedure  resulted  in  the  highest  difference  between  the  two  error  rates  at 
all  MTL  levels  (tied  with  the  conventional  procedure  at  the  10-item  MTL  level). 

One  interesting  result  was  observed  when  the  errors  made  with  the  b-vari¬ 
able  item  pool  were  compared  with  those  made  using  the  a-,  b-,  and  c-variable 
item  pool.  For  the  Invariable  pool  each  of  the  testing  procedures  was  more 
likely  to  make  false  mastery  decisions  than  false  nonmastery  decisions.  This 
tendency  was  reversed  for  the  a^-,  b_-,  and  c-variable  item  pool,  where  each  of 
the  procedures  made  more  false  nonmastery  decisions  than  false  mastery  deci¬ 
sions.  These  trends  were  most  noticeable  for  each  of  the  testing  procedures  at 
the  10-item  MTL  level,  and  most  noticeable  for  the  SPRT  procedure  across  all  MTL 
levels.  It  is  probable  that  these  trends  were  artifacts  of  the  random  sampling 
of  items  used  to  create  the  conventional  tests,  since  the  shorter  conventional 
tests  would  be  less  representative  of  the  item  domain  due  to  the  small  sample  of 
items  taken.  The  results  obtained  here  would  be  explained  by  a  very  easy 
10-item  conventional  test  being  drawn  from  the  ^-variable  pool  and  a  very  diffi¬ 
cult  10-item  test  being  drawn  from  the  a-,  b-,  and  c-variable  pool.  In  fact, 
the  mean  b-value  for  the  10-item  conventional  test  drawn  for  the  b-variable  pool 
was  -.80;  for  the  a^-,  _b-,  and  c_-variable  pool,  it  was  1.25.  This  would  also 
explain  the  observation  that  the  SPRT  procedure  most  clearly  showed  these 
trends,  since  the  SPRT  procedure  used  shorter  test  lengths,  on  the  average,  than 
the  other  two  procedures  to  make  its  final  decisions  and  therefore  was  most 
prone  to  small-sample  artifacts. 


DISCUSSION  AND  CONCLUSIONS 

Several  trends  were  noted  in  the  data  concerning  the  performance  of  the 
three  testing  strategies  in  the  three  different  item  pools.  In  every  instance 
the  AMT  and  SPRT  procedures  produced  reductions  in  the  mean  test  length  required 
to  make  mastery  decisions.  This  reduction  increased  with  the  MTL  level  in  each 
circumstance.  The  AMT  procedure  resulted  in  reductions  of  6%  to  54%  from  the 
length  of  the  conventional  test.  The  SPRT  procedure  resulted  in  reductions  of 
4%  to  69%.  On  the  average,  the  SPRT  procedure  required  fewer  items  to  make  the 
mastery  decision. 
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The  correspondence  between  the  estimated  mastery  status  and  the  true  mas¬ 
tery  status  systematically  increased  with  MTL  for  all  testing  procedures  in  each 
item  pool.  The  correspondence  fairly  systematically  decreased  from  the  uniform 
pool,  to  the  b-variable  pool,  to  the  a-,  b-,  and  c-variable  pool.  The  AMT  pro¬ 
cedure  resulted  in  the  highest  level  of  correspondence  in  all  circumstances  but 
one  (the  conventional  test  performed  best  for  the  50-item  MTL  with  the  uniform 
pool).  On  the  average,  though,  the  differences  between  different  MTL  levels 
were  more  pronounced  than  differences  between  testing  procedures.  Further,  the 
type  of  item  pool  used  had  pronounced  effects  on  the  correspondence  obtained. 

The  AMT  procedure  resulted  in  the  most  even  frequencies  in  the  types  of 
decision  errors  made  across  most  MTL  levels  and  item  pools.  This  was  desirable, 
since  both  error  types  were  assumed  to  have  the  same  relative  cost.  Further,  it 
was  noted  that  the  SPRT  procedure  was  most  susceptible  to  small-sample  arti¬ 
facts,  resulting  in  an  imbalance  in  the  frequencies  with  which  the  two  types  of 
errors  were  made. 

To  prescribe  the  best  testing  strategy  of  those  described  here  requires 
specification  of  priorities  and  conditionals.  If  a  uniform  item  pool  is  as¬ 
sumed,  the  SPRT  procedure  required  the  fewest  items  while  resulting  in  decisions 
having  correspondence  coefficients  that  were  quite  comparable  to  the  other  two 
procedures.  If,  however,  the  item  pool  includes  items  with  variable  a,  b,  and  c 
parameters,  the  SPRT  procedure  may  result  in  the  shortest  tests,  but  the  AMT 
procedure  will  make  more  accurate  classifications.  These  factors  must  be  con¬ 
sidered  before  any  decision  is  made  as  to  which  procedure  is  "best." 

It  should  also  be  noted  that  this  simulation  was  based  on  the  assumption 
that  the  latent  achievement  metric,  rather  than  the  propertion-correct  metric, 
was  the  correct  metric;  and  to  the  extent  that  the  proportion-correct  metric  is 
the  correct  metric,  the  findings  of  this  study  are  less  relevant.  In  addition, 
several  variations  on  the  SPRT  procedure  and  the  AMT  procedure  that  were  not 
examined  in  this  study  are  possible;  thus,  additional  research  is  necessary  be¬ 
fore  firm  conclusions  can  be  drawn  concerning  the  utility  of  adaptive  mastery 
testing  strategies. 
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Discussion:  Session  3 


Melvin  Novick 
University  of  Iowa 


I  shall  discuss  some  general  methodological  issues  that  bear  on  the  papers 
by  Reckase,  Kalisch,  and  Kingsbury  and  Weiss  and  also  on  previous  papers  presen¬ 
ted,  integrating  into  the  discussion  relevant  points  that  have  been  made  by 
Lord,  Wainer,  Samejima,  Lumsden,  and  others. 

The  results  that  have  been  obtained  in  these  papers  are  contradictory. 

There  seems  to  be  difficulty  deciding  whether  or  not  adaptive  testing  is  worth¬ 
while  with  a  Bayesian  approach — which  is  related  to  the  kinds  of  models  that 
have  been  adopted  and  the  kinds  of  statistical  analysis  that  are  being  per¬ 
formed.  Lord  made  an  important  comment  about  the  metric  in  which  a  least 
squares  analysis  is  performed;  and  although  the  suggestion  he  made  in  that  con¬ 
text  was  very  good,  it  opens  up  the  question,  which  is  the  correct  metric? 
Wainer' s  comments  about  robustness  are  also  very  important;  indeed,  some  of  the 
problems  that  we  have  had  have  resulted  from  allowing  a  few  outliers  to  mar  the 
analyses.  An  important  part  of  my  discussion  will  also  bear  on  his  comment, 
"Let's  look  at  the  ends,  because  it  doesn't  matter  what's  going  on  in  the  mid¬ 
dle."  Samejima' s  comments  about  dimensionality  are  crucial;  and  Lumsden' s  com¬ 
ment  about  the  importance  of  choosing  the  statistical  analysis  for  the  particu¬ 
lar  decision  at  hand  is  central  to  my  discussion. 

I  am  absolutely  delighted  to  see  that  everyone  is  using  Bayesian  methods: 

It  is  a  dream  come  true.  The  realization  of  the  dream,  however,  remains  impre¬ 
cise.  Although  Bayesian  procedures  are  being  used,  the  analyses  are  not  all 
Bayesian  which  is  part  of  the  problem  I  hope  to  correct. 

A  brief  discussion  is  in  order  about  the  development  of  pre-Bayes  statisti¬ 
cal  theory  and  its  application  in  a  Bayesian  decision  theoretic  context.  First 
was  Gauss's  work  on  least  squares,  which  led  to  a  certain  mean  value  as  an  esti¬ 
mate;  this  was  followed  by  the  Gauss-Markov  theorem,  which  tied  least  squares 
with  the  normal  distribution.  At  about  the  same  time,  La  Place  was  working  with 
absolute  error  loss,  which  is  typically  a  better  loss  function  than  squared  er¬ 
ror,  and  in  my  judgment  La  Place  deserves  more  credit  than  he  has  been  given. 
Once  the  question,  how  to  obtain  an  estimator,  has  been  posed  and  considered  in 
terms  of  the  appropriate  loss  function,  a  whole  new  set  of  problems  arises. 

Even  though  absolute  error  loss  may  be  better  than  squared-error  loss,  in 
some  of  the  applications  this  places  too  much  weight  on  those  large  discrepan¬ 
cies.  In  terms  of  mastery  testing,  for  example,  it  does  not  matter  if  the  per¬ 
son  is  three  standard  deviations  from  the  criterion  or  four.  Certainly,  as 
Wainer  has  said,  we  do  not  want  the  analysis  to  be  affected  very  much  by  that, 
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particularly  when  it  is  recognized  that  the  distributions  are  not  normal  but 
that  there  are  all  kinds  of  outliers  and  unusual  data  values.  This  is  not  a 
minor  point.  It  affects  all  the  analyses  that  are  being  done.  A  very  careful 
look  must  be  taken  at  the  loss  function  in  deciding  whether  the  decision  rule  or 
even  an  estimator  is  any  good.  In  my  judgment,  none  of  the  loss  functions  that 
have  been  talked  about  at  this  conference  are  acceptable. 

I  did  use  threshold  loss  in  papers  published  several  years  ago;  but  at  that 
time,  Bayesian  methods  with  threshold  loss  were  better  than  classical  methods. 
Now  there  are  better  methods,  and  recent  papers  discussing  more  general  loss  or 
utility  functions  provide  much  more  acceptable  methods.  In  these  papers  the 
normal  ogive  is  used  as  a  utility  function.  This  is  a  clear  improvement  over 
threshold  utility.  However,  there  is  a  Stage  3  in  which  a  cumulative  data  dis¬ 
tribution  may  be  used — perhaps  some  other  ogival  forms — as  a  utility  function. 

One  of  the  techniques  that  was  used  in  a  paper  in  an  earlier  session  was  to 
ascertain  how  adaptive  testing  improves  reliability  and  squared-error  loss.  The 
difference  between  looking  at  a  reliability  and  looking  at  a  squared-error  loss 
is  that  reliability  forgets  about  any  kind  of  bias.  However,  squared-error  is 
actually  irrelevant  to  a  context  in  which  a  mastery  decision  or  a  selection  is 
being  made.  This  does  not  mean  that  I  repudiate  either  classical  test  theory, 
which  is  built  largely  on  mean-squared-error,  or  latent  trait  theory.  Those 
methods  are  useful  in  certain  contexts,  e.g.,  when  developing  a  test  that  is 
going  to  be  used  for  a  wide  range  of  purposes,  when  interest  is  in  discrimina¬ 
tion  across  the  whole  range  )f  ability  and  some  overall  measure  is  needed,  and 
for  the  SAT  and  ACT  tests.  These  methods  are  much  less  useful  in  the  context  in 
which  there  is  a  question  of  mastery  or  selection  and  one  has  a  fair  idea  where 
that  selection  is  going  to  be.  Then,  it  is  desirable  to  peak  the  tests  at 
roughly  that  point,  but  it  is  also  desirable  to  use  a  loss  function,  or  better 
yet,  a  utility  function  that  focuses  on  that  point.  Therefore,  looking  at  ques¬ 
tions  of  reliability  and  squared-error  loss  does  not  really  address  the  question 
of  the  efficacy  of  the  procedure'  in  any  real  way. 

On  a  related  issue,  there  has  been  discussion  on  using  Bayesian  modal  esti¬ 
mates  or  maximum  likelihood  estimates,  which  are,  of  course,  also  Bayesian  modal 
estimates  assuming  a  uniform  prior  distribution  on  a  particular  parameteriza¬ 
tion.  These  are  appropriate  only  in  terms  of  a  zero-one  loss  function,  a  most 
unrealistic  loss  function  in  this  context.  Therefore,  the  analyses  based  on 
maximum  likelihood  or  a  Bayesian  modal  estimator  may  be  unrealistic. 

The  dimensionality  issue  is  crucial.  Something  like  the  reliability- 
validity  paradox  may,  in  fact,  be  occurring  here,  as  Samejima  suggests.  It 
would  not  surprise  me  at  all  if  we  are  dealing  with  a  test  that  is  multidimen¬ 
sional  and  a  criterion  that  is  almost  certanly  multidimensional.  If  this  is 
true,  and  if  a  dominant  trait  is  focused  on,  we  may  be  building  up  reliability 
and  not  measuring  the  other  traits  that  are  essential  in  prediction.  Thus,  va¬ 
lidity  will  suffer.  The  answer  to  this  is  probably  to  study  the  predictor  and 
the  criterion  carefully  and  to  define  the  factors  or  traits  and  see  that  each 
one  is  measured  carefully. 


Next,  if  least  squares  is  to  be  used,  which  I  do  not  really  advocate,  there 
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is  the  question  of  what  metric  to  do  it  in.  Should  it  be  done  in  a  latent  vari¬ 
able  metric?  This  causes  problems  because  computations  sometimes  do  not  con¬ 
verge.  Should  it  be  done  in  the  true  score  metric,  which  is  tighter?  Although  I 
do  not  know  the  answer  to  that  question  at  present,  the  question  should  not  be 
ignored . 

The  questions  that  need  to  be  considered  are  (1)  How  much  efficiency  is 
being  obtained?  (2)  Where  is  the  efficiency  being  sought?  and  (3)  What  is  the 
appropriate  measure  of  efficiency?  If  a  procedure  is  being  designed  to  assist 
in  selection  near  the  top  of  the  distribution  or  at  some  other  criterion  point, 
it  really  does  not  matter  whether  or  not  better  estimation  is  being  obtained 
away  from  that  point.  It  is  totally  irrelevant  to  state  that  there  is  only  a  3% 
increase  in  efficiency  overall.  A  50%  increase  could  be  obtained  where  needed, 
still  averaging  out  to  5%  overall.  That  would  not  be  bad.  This  is  a  question 
of  how  the  gains  are  computed,  which,  again,  may  be  related  to  the  question  of 
robustness.  We  may  be  doing  terrible  things  with  some  outlier;  but  if  a  testee 
is  completely  off  the  scale,  perhaps  it  does  not  really  matter,  because  a  large 
error  will  not  affect  the  decision. 

The  Owen  procedure  is  a  good  Bayesian  procedure:  It  does  make  some  assump¬ 
tions.  Even  though  some  of  the  assumptions  that  it  makes  may  not  be  terribly 
well  satisfied  for  the  first  one-half  dozen  items,  improvements  are  possible, 
but  that  is  not  important.  If  any  reasonable  Bayesian  procedure  is  used,  a 
great  deal  will  be  gained  from  the  Bayesian  allocation.  If  a  person  is  seated 
at  a  terminal,  it  may  not  be  very  significant  whether  he/she  takes  5  items  or  6 
items.  Thus,  I  am  not  so  sure  that  the  emphasis  on  variable  stopping  is  impor¬ 
tant.  Some  rules  could  probably  be  worked  out  that,  by  and  large,  would  provide 
good  results  if  all  testees  were  given  a  Bayesian  allocated  test  of  specified 
length  and  the  decision  were  made  at  that  point.  The  advantage  would  be  that 
most  of  the  inaccuracies  in  the  approximations  of  the  Owen  procedure  would  be 
eliminated.  If,  indeed,  the  saving  of  one  item,  ,on  the  average,  has  a  high  pay 
off,  then  presumably  someone  would  be  willing  to  make  a  large  investment  to  ob¬ 
tain  the  needed  refinements. 

Now,  I  should  like  to  treat  some  specifics  of  the  Reckase  and  Kingsbury  and 
Weiss  papers.  In  each  paper  there  is  an  emphasis  on  the  Wald  Sequential  Proba¬ 
bility  Ratio  Test  (SPRT).  The' original  application  of  this  method  was  that 
there  was  a  production  process  in  control  with  a  certain  error  rate  that  was 
tolerable.  The  concern  was  that  something  had  happened  that  seriously  degraded 
production  quality  and  it  was  desirable  to  identify  the  problem  very  quickly. 
Therefore,  it  was  very  reasonable  to  take  a  certain  point  hypothesis,  a  3%  error 
rate,  with  the  recognition  that  if  the  process  was  not  in  proper  working  order, 
that  error  rate  was  going  to  go  up  to  10%  and  therefore  the  alternative  hypothe¬ 
sis  of  10%  should  be  used.  That  paradigm  is  not  correct  in  the  context  of  adap¬ 
tive  testing.  What  the  SPRT  formulation  gives  is  utility  functions  with  three 
levels  corresponding  to  the  false  positive,  false  negative,  and  indifference 
zones.  In  fact,  an  appropriate  utility  function  would  be  continuous  and  not 
abrupt  in  change  of  magnitude  of  the  first  derivative  (see  Novick  &  Lindley, 
1978).  This  is  very  important  because  it  has  a  very  substantial  effect  on  the 
analysis,  both  in  terms  of  the  number  of  observations  needed  and  in  terms  of  the 
decision  rule  to  be  adopted. 
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A  minor  technical  point  is  that  one  simply  cannot  look  ahead  one  step,  com¬ 
puting  the  cost  of  taking  an  observation  and  comparing  this  with  the  expected 
gain,  and  then  stopping  when  it  is  discovered  that  the  expected  gain  does  not 
exceed  the  cost  of  the  observation.  In  fact,  all  possible  sample  sizes  would 
have  to  be  investigated  to  make  sure  that  none  yielded  an  expected  gain.  I  am 
not,  however,  arguing  for  this  complication;  indeed,  I  am  arguing  for  a  simpli¬ 
fication  to  fixed  sample  sizes. 

Finally,  although  there  are  a  half  a  dozen  other  examples  within  an  epsilon 
of  the  one  I  selected  to  discuss,  Kingsbury  and  Weiss's  paper  presents  the  most 
simple  and  striking  example  of  doing  Bayesian  analysis  without  a  saturation  of 
understanding  Bayesian  theory.  The  idea  of  looking  at  the  Bayesian  confidence 
interval,  or  as  I  would  prefer  to  call  it,  the  Bayesian  credibility  interval, 
and  then  stopping  when  that  Bayesian  interval  no  longer  included  a  particular 
point  is  perfectly  reasonable.  In  the  context  of  mastery  decision  making,  how¬ 
ever,  I  cannot  understand  why  a  two-sided  interval  was  computed.  It  makes  no 
sense  at  all  from  any  kind  of  Bayesian  logic.  Any  consideration  of  a  concept  of 
utility  or  loss  must  lead  to  a  one-sided  interval.  That  struck  me  as  being  the 
most  glaring  failure  to  bring  decision  theory  to  bear  on  what  is  being  done.  If 
pressed,  however,  I  could  find  a  half  a  dozen  more  examples;  and  that,  I  think, 
is  discouraging. 
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Generally  speaking,  the  fundamental  role  of  the  mathematical  model  in  psy¬ 
chology  is  to  simulate  psychological  reality  following  some  sound  rationale, 
using  well-defined  parameters.  The  normal  ogive  model  in  latent  trait  theory, 
for  example,  is  one  of  such  mathematical  models.  Another  role  of  the  mathemati¬ 
cal  model  may  be  to  provide  some  mathematical  convenience,  just  as  the  logistic 
model  does  in  its  relationship  with  the  normal  ogive  model. 

Mathematical  models  of  a  third  type,  which  have  their  specific  usefulness 
in  the  context  of  comprehensive  theories  and  methods,  can  be  conceived.  The 
direct  simulation  of  psychological  reality  is  less  important  for  this  type  of 
model  than  it  is  for  the  first  two  types  of  models.  The  Constant  Information 
Model  belongs  to  this  new  type;  its  role  is  to  be  of  assistance  in  the  develop¬ 
mental  stages  of  theories  and  methods,  rather  than  to  simulate  psychological 
reality  directl'/. 

The  Constant  Information  Model  (Samejima,  1979)  is  a  new  model  on  the  di¬ 
chotomous  response  level  (Samejima,  1972).  This  model  provides  a  constant  item 
information  for  a  finite  interval  of  a  latent  trait.  Although  the  usefulness  of 
the  model  has  not  yet  been  fully  investigated,  an  effective  use  of  the  model  has 
been  found  in  the  process  of  estimating  the  operating  characteristics  of  item 
response  categories  (Samejima,  1977b,  1977d,  1978a,  1978b,  1978c,  1978d,  1978e, 
1978f ) . 

Let  0  be  ability,  or  latent  trait,  which  is  assumed  to  be  unidimensional. 
Let  £  (=  1,  2,...,  n)  be  an  item  and  £  (=  0,  1,  2,...,  m  )  denote  an  item  re- 

sponse  category,  or  item  score.  The  operating  characteristic,  P  (0),  of  the 

xg 

item  score  x  is  defined  by 
-g 

Px  =  Prob • lxq I 01  .  [ 1] 

9 

Let  V  be  a  response  pattern,  such  that 

^  “  (^^i  ^2*  *..»  *.*j  ^^)  ,  [2] 

and  Pv(0)  be  its  operating  characteristic.  Because  of  the  assumption  of  local 
independence , 
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p„(e)  =  n  p  (6)  [3 

Y-v  9 

can  be  written.  The  item  response  information  function  (Samejima,  1969), 

I  (9)>  is  defined  by 
8 

I  (8)  * - —  log  P  (0)  I* 

X  ,02  X 

g  30  g 

and  the  item  information  function,  I„(9),  is  the  conditional  expectation  of  the 

© 

item  response  information  function,  given  0,  so  that 


I  (6)  =  Z?[I  <8 )  1 0  ]  =  I  (0)  P„  (0) 


A  «V 

x  =  0  a 

g 


rn  ~ 

"  ^  P*  <6>12  <P*  (8)]-1 
x  =0  q  q 

g  y  v 

(cf.  Samejima,  1969,  chap.  6).  The  response  pattern  information  function, 
Iv(9)»  can  be  written  for  a  specified  response  pattern  V  such  that 


V0) 


log  pv(q)  =  z  r  (9)  , 

i  eF  x  g 

y 


and  the  test  information  function,  1(0)  ,  is  defined  as  the  conditional  expecta¬ 
tion,  given  0,  of  the  response  pattern  information  function,  which  can  be  writ- 


1(0)  =  E  I  (0)  P  (6)  =  E  I  (0)  .  [7 

V  g-l  " 

When  item  jg  is  binary  (i.e.,  is  scored  either  0  or  1),  u^  is  used  for  the 

item  score  category  instead  of  x  .  The  item  characteristic  function,  P „(0),  is 
defined  by  B 

P  (8)  =  prob . [w^=l | 6 ]  ,  t® 

and  the  other  operating  characteristic  for  ug  ■  0,  which  is  denoted  by  Qg(0), 
can  be  written  as  1 


Qg(B)  =  prob.  [i^-0|ej  -  1  -  P  ( 0 )  . 


From  Equations  5,  8,  and  9 


V6>  '  V9’1’  iVe)  V6)1"  ’ 


[10] 
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which  is  identical  with  the  item  information  function  defined  by  Birnbaum 
(1968),  can  be  obtained  for  the  item  information  function. 

The  Constant  Information  Model 


The  item  characteristic  function  of  the  new  model,  the  Constant  Information 
Model  (CIM),  is  given  by 


Ve) 


sin  [a^(0  -  b  )  +  (tt/4)]  ,  for  9_  <  9  <  0 
0  otherwise, 


[11] 


where 

9  *  [ -Tra“J  /  4  1  +  b 

9  g 

9  =  [ 7T a"1/ 4  ]  +  b 

9  9 


[12] 


Figure  1 

Item  Characteristic  Functions  of  Five  Binary  Items 
Following  the  Constant  Information  Model 


LATEN7  TRAIT  6 

Figure  1  presents  five  examples  of  the  item  characteristic  functions  in  the 
CIM  with  varieties  of  sets  of  parameters.  From  Equations  9  and  11,  Qg(9)  can  be 
written  for  the  other  operating  characteristic  such  that 


♦  • 
♦  \ 
»  ' 


I  i 

4 


i 

.t 


'•M 


i 


l-'nag  / 4 J  +  bQ  <  0  <  M ]  +  bg 


it  is  obvious  that  in  this  model : 


[15] 


(A)  the  item  characteristic  function  is  strictly  increasing  in  0  in  the  above 
interval  of  0, 


and 

(B) 


5U  ve)  =  ° 


lira 

0-*-0 


Pn(Q)  =  l  . 


[16] 


For  convenience,  hereafter,  the  CIM  will  be  considered  only  for  the  range  of  0 
given  by  Equation  15,  unless  otherwise  stated.  The  mathematical  models  which 
satisfy  (A)  and  (B)  will  be  called  models  of  Type  I. 

It  is  obvious  from  Equations  11,  13,  and  14  that  the  model  provides  constant 
item  information  such  that 


V0) 


[17] 


where  C  indicates  the  constant  amount  of  information.  Figure  2  presents  the 
© 

item  information  functions  for  the  five  items  whose  item  characteristic  func¬ 
tions  are  given  in  Figure  1.  The  length  of  the  interval  of  0  for  which  the  item 
information  function  equals  Cg  is  given  by 


0 


9 


IT  C, 


9 


V[2ag] 


[18] 


.... _ _ 
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Figure  2 

Item  Information  Functions  of  Five  Binary  Items 
Following  the  Constant  Information  Model 

<3;*  0.25  and  b,  »  0.00 

—  a2=  0.50  and  b}  “  0.50 

a3=  0.75  and  b}  =  2.00 

—  —  £1^  =  1.00  and  b4  =  1.50 

——i—  as=  2.00  and  b5  =  0.50 


-4.00  ji.oo  -$.00 j-lj.00 0.)0 ljjbo 


- 1 00  3|^0  4.00 


LATENT  TRAIT  0 

-3.1416  -2.2854  -1.0708  0.1073  0.8927  2.0708  3.0472 


-0.7146 
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The  basic  function,  A  (0)  (Samejima,  1969,  1972),  which  is  defined  by 

xg 

(«>  '  W  1°«  px  («>  [I 

9  g 

for  the  item  response  category  x  ,  is  obtained  for  the  CIM  from  Equations  11, 
13,  14,  and  19  such  that  "" 8 


-  "2VV0)1~%IV0)1~* 

=  -2 ag  tan[a^(9  -  b  )  +  ir/4] 

for  u  »  0 
9 

-  2a^[^(0)]}5[P?(e)]-)S 

-  2a^  cot  [a^  (0  -  fe  )  +  tt/4] 
for  ug  -  1 


4  (9) 

U  '  7 

9 


Figure  3  presents  these  basic  functions  for  the  five  items  shown  earlier.  It  is 
clear  from  Equation  20  that  the  basic  function  for  u^  ■  1  is  a  strictly  decreas¬ 
ing  function  of  6  with  positive  infinity  and  zero  as  its  two  asymptotes  and  that 
for  ug  ■  0  it  is  a  strictly  decreasing  function  of  0  with  zero  and  negative  in— 
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Figure  3 

Basic  Functions  of  Five  Items  Following  the  Constant  Information  Model 
f°r  Ug  ■  0  (Upper  Graph)  and  for  u»  1  (Lower  Graph) 


-3.1416  -2.2854  -1.0708  0.1073  0.9528 
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flnity  as  Its  two  asymptotes,  respectively.  This  is  also  confirmed  visually  by 
Figure  3. 


The  item  response  information  functions  for  each  of  the  binary  scores,  Ug  « 
0  and  u_  =  1,  are  given  from  Equations  4,  11,  13,  and  14  by 


I  =  2a 2  sec2  [a  (8  -  b  )  +  tt/4] 
I  9  9  9 


I 

u 


(0) 

9 


< 


2a2[Q  >  0 

y  y 

for  u  =  1 

9 

2 a2  esc2 [a  (8  -  b  )  +  n/4] 

g  g  g 


* 

2a2g[Pg(B)]~1  >  0 

for  u  =1 

9 


[21] 


Figure  4  illustrates  these  two  functions  for  the  item  with  parameters  a  -  0.25 

O 

and  b  =  0.00,  together  with  the  constant  item  information  (-0.25).  The  re- 

sponse  pattern  information  function,  1^(0)  ,  is  written  from  Equations  6  and  21 
such  that 

I  (8)  =  2  1  a2  [PA 8)1  Ug  [Q  AB)  ]Ug~l  [22] 

u  cV  a  y 

9 

and  the  test  information  function,  1(9),  is  given  from  Equations  7  and  17  by 

n 

1(8)  =  4  Z  a1  .  r  23 ] 

0-1  9 

Figure  5  presents  both  the  set  of  four  response  pattern  information  functions 
and  the  test  information  function  for  a  hypothetical  test  of  two  binary  items, 
whose  item  parameters  are  £1  -  0.25  ,  -  0.00  ,  £2  -  0.50,  and  l>2  -  1.00,  each 

of  which  follows  the  CIM. 


It  should  be  noted  that  the  interval  of  6  for  which  the  item  information  is 
a  positive  constant  is  always  finite.  This  is  related  to  the  fact  that 


[I  (6)  ]** 


de 


*  , 


[24] 


for  any  model  of  Type  I  ,  which  includes  the  CIM  (Samejima,  1979).  Thus, 


i- 


•V 


C  ‘(0  -  0)  =■  TT 

y 

£)  must  therefore  be  a  finite  value. 

Figure  4 

Item  Response  Information  Functions  of  an  Item  Following 
the  Constant  Information  Model,  with  the  Parameters 
a„  =  .25  and  b„  =  0.0  for  u„  ■  0  and  for  u  »  1, 

— g 


a  =  .25  and  b  =  0.0  for  u  ■  0  and  for  u  ■  1, 

O  O  O  O 

Together  with  the  Constant  Item  Information  Function 


V0 

U9  1 

Constant  Item 
Information  Function 


LATENT  TRAIT  0 


Use  of  the  Constant  Information  Model  in  the  Estimation  of  the 
Operating  Characteristics  of  Item  Response  Categories 

The  methods  and  approaches  for  estimating  the  operating  characteristics  of 
the  item  response  categories  of  a  test  item  developed  so  far  (Samejima,  1977b, 
1977d,  1978a,  1978b,  1978c,  1978d,  1978e,  1978f)  have  common  characteristics 
such  that  (1)  they  are  made  without  assuming  any  prior  mathematical  form  and  (2) 
they  use  a  relatively  small  number  of  examinees  in  the  process  of  estimation. 

One  common  restriction  in  these  methods  and  approaches  is  that  what  is  needed  is 
the  Old  Test,  consisting  of  the  items  whose  operating  characteristics  are  known, 
which  provides  a  constant  test  information  function  for  the  interval  of  latent 
trait,  or  ability,  9,  of  interest.  With  this  setting,  each  examinee's  ability 
level  is  estimated  from  his/her  response  pattern  by  the  maximum  likelihood  esti¬ 
mation;  and  the  set  of  these  maximum  likelihood  estimates  is  the  main  informa¬ 
tion  source  of  the  estimation  procedures. 
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Figure  5 

Response  Pattern  Information  Functions  of  the  Four  Possible 
Response  Patterns  of  Two  Binary  Items  Following  the 
Constant  Information  Model  with  the  Parameters 
£j  “  .25,  _b :  =  0.0,  _a 2  ■  .50,  and  b2  -  1.0 
o 

O 


It  will  be  noted  that  these  methods  and  approaches  are  useful  in  a  situation 
when  there  already  is  an  item  pool  and  it  is  desired  to  add  more  test  items  to 
it.  In  a  different  setting  where  the  starting  point  is  the  very  beginning  of 
item  pool  development,  however,  these  methods  appear  to  be  useless,  since  there 
is  no  Old  Test,  and,  consequently,  the  maximum  likelihood  estimate  of  each  exam¬ 
inee's  ability  is  not  a  priori  given. 

In  the  above  setting,  it  is  very  common  that  a  large  number  of  test  items, 
which  include  many  items  of  intermediate  difficulties,  is  administered  to  the 
group  of  examinees  chosen  for  the  purpose  of  developing  the  item  pool.  If, 
among  the  items  administered,  a  substantially  large  subset  of  binary  items  can 
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be  found  which  can  be  regarded  as  equivalent  items,  then  that  subset  of  items 
can  be  used  in  place  of  the  Old  Test.  Even  though  their  item  characteristic 
functions  are  not  known,  this  can  be  done  with  the  aid  of  the  CIM.  In  other 
words,  it  is  possible  to  expand  the  estimation  methods  developed  so  far  to  make 
them  usable  when  there  is  no  Old  Test. 

Estimation  Procedures 


In  practice,  it  is  necessary  to  identify  these  equivalent  items  without 
knowing  their  operating  characteristics.  This  can  be  done  as  follows:  First  of 
all,  the  proportion  correct  must  be  obtained  for  each  item.  If  a  subset  of 
items  exists  whose  proportions  correct  are  around  .5  and  close  enough  to  one 
another,  then  these  items  make  a  good  candidate  for  the  subset  of  equivalent 
items.  Second,  a  2  x  2  contingency  table,  as  exemplified  in  Table  1,  must  be 
made  for  each  pair  of  items  in  the  subset.  In  order  for  the  items  to  be  equiva¬ 
lent,  it  is  necessary  that  within  the  range  of  sampling  fluctuations,  these  2  x 
2  contingency  tables  of  the  bivariate  frequency  distributions  should  be  symmet¬ 
ric  and  identical  for  all  the  pairs  of  binary  items.  This  can  be  checked  for 
every  pair  of  binary  items  that  have  passed  the  first  screening,  and,  very  like¬ 
ly,  more  items  must  be  eliminated  through  this  second  screening.  Now  there  can 
be  progression  to  the  third  screening  of  23  contingency  tables,  to  the  fourth 
screening  of  24  contingency  tables,  and  so  forth.  Note,  however,  that  an  in¬ 
creasingly  large  number  of  more  complicated  contingency  tables  is  usually  en- 


Table  1 

Two  Typical  2x2  Contingency  Tables 
for  a  Pair  of  Equivalent  Items  with  a 
Common  Low  Discrimination  Parameter 
and  with  a  Common  High  Discrimination  Parameter 


Item  h 


Item  g 


Hh”° 


Hh*1 


Total 


Low  Discrimination  Parameter 


Ug-0 

no 

243 

353 

^g"1 

248 

399 

647 

Total 

358 

642 

1000 

Discrimination  Parameter 

o 

■ 

31° 

300 

53 

353 

V1 

58 

589 

647 

Total 

358 

642 

1000 

countered  as  progression  is  made  to  higher  stages  of  screening.  Some  criterion 
must  be  set  for  terminating  this  process,  therefore,  the  remaining  items  must  be 
accepted  by  assuming  their  equivalence.  Table  1  illustrates  two  typical  2*2 
contingency  tables— one  is  for  a  pair  of  equivalent  binary  items  that  has  a  com¬ 
mon  low  discrimination  parameter  and  the  other  Is  for  a  pair  that  has  a  common 
high  discrimination  parameter. 

After  the  subset  of  equivalent  binary  items  has  been  identified,  then  the 
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CIM  Is  assumed  for  each  equivalent  Item.  This  assumption  Is  made  without  loss 
of  generality,  since  the  scale  of  the  latent  trait  is  subject  to  any  strictly 
Increasing  transformation  (Samejima,  1969,  1979). 

The  next  step  is  to  obtain  the  maximum  likelihood  estimate,  0,  of  9  on  the 
response  pattern  of  each  of  the  N  examinees,  with  respect  to  the  above  subset  of 
equivalent  items.  Let  V*  be  the  examinee's  response  pattern  on  the  subset  of  k 
equivalent  items.  Since  they  are  equivalent  items,  the  simple  test  score, 
such  that 


t  •  I  u  [26] 

u  eV*  9 
9 

is  a  minimal  sufficient  statistic,  regardless  of  the  model  that  these  item  char¬ 
acteristic  functions  follow  (cf.  Birnbaum,  1968).  Thus,  the  procedure  of  maxi¬ 
mum  likelihood  estimation  becomes  much  simpler,  using  the  test  score  t^  instead 
of  the  response  pattern  V*.  In  general,  for  any  model  on  the  dichotomous  re¬ 
sponse  level 

u  1  -u 

Pv*(0)  -  n  [P  (6) 1  *[«„(0)]  9  •  [27] 

v  u  elT*  9  9 

9 

Since  this  operating  characteristic  of  the  response  pattern  V*  is  itself  the 
likelihood  function  in  estimating  the  examinee's  ablity,  the  symbol  Lv*(9)  will 

be  used  for  this  function  in  the  present  section.  When  all  the  items  are  equiv¬ 


alent,  Equation  27  can  be  rewritten  in  the  form 

£y*(0)  -  [Pff<0)]* l«  (0)]fc_t  .  [28] 

Thus,  for  the  likelihood  equation 

Jq  log  ^*<6>  =  I  JQ  PgiVllt  -  kPg(Q)]  [29] 

tP<?(0) V9)1_J  "  °  » 

and  the  equation 

t  -  kP  (6)  [30] 

y 

is  obtained.  For  the  maximum  likelihood  estimate  9, 

0  -  P_1  (t/k)  [31] 

0 


Now,  Equation  11  must  be  substituted  into  Equation  31,  which  results  in 

0  -  a  [sin  1  (tlk)^  -  ir/4]  +  b  . 

9  9 


[32] 
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It  is  obvious  from  Equations  15  and  32  that  the  range  of  0  is  given  by 

[-t ra  */4]  +  b  <  0  <  [fra  */4]  +  6  .  [33] 

0  g  -  -  g  g 

Thus,  the  maximum  likelihood  estimate  of  each  examinee  is  obtained,  based 
upon  his/her  test  score  for  the  subset  of  k  equivalent  items.  This  set  of  N 
maximum  likelihood  estimates  can  be  used  in  place  of  the  one  obtained  on  the  Old 
Test,  and  the  process  of  the  operating  characteristic  estimation  in  any  combina¬ 
tion  of  a  method  and  an  approach  can  be  followed.  The  error  variance,  a2,  is 
given  by 


where 


a2  =  [kC]-1  =  [4 ka2  ]-1  , 


C^=C2-...-C,-C 


CL  m  ~~  QL  rs  ~  ~ "  CL  ■*  CL  • 

12  k 


After  the  process  of  estimation  has  been  completed,  the  latent  trait  0  can 
be  transformed  to  another  latent  trait  T  by  any  strictly  increasing  function, 
t(0).  To  give  some  examples,  if  0  is  transformed  to  x  by 


x  =  P*"  ([ainta  (0-6  )  +  (it/4)}]2)  , 

y  y  y 


where 


P*(T)  =  ( 2xr) 

g 


Jas(r 

J  —  CO 


-  bV 

exp [ -t  H ]dt  , 


then  all  the  equivalent  items  will  follow  the  normal  ogive  model  specified  by 
Equation  38,  and 


0  =  —00 


0  =  00  ; 


if  the  transformation  is  such  that 

x  =  (8  -a  )  sin2  [a  (0  -  b  )  +  (tt/4)]  +  a  ,  [4 

y  y  y  y  y 

then  they  will  follow  the  linear  model,  whose  item  characteristic  is  given  by 

ye)  -  <e-y<yy"1 


^ 


[41] 
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with 


and  if  9  is  transformed  to  x  by 


[42] 


*  =  [  1/  (.Da*)  ]  log[tan2{<z  (0-fc  )  +  (  tt  /  4 )  }  ]  +  b*  ,  [43] 

3  3  3  3 

then  they  will  follow  the  logistic  model,  which  is  characterized  by 

P*(T)  =  [1  +  exp {-Da* (x-i*) } ]  1  [44] 

y  y  y 

with  the  9  and  9  given  in  Equation  39.  Similar  transformations  can  be  made  to 
change  the  item  characteristic  functions  of  k  equivalent  items  from  those  of  the 
CIM  to  those  of  any  other  models  of  Type  I.  In  each  case,  the  newly  estimated 
operating  characteristics  of  the  other  items  will  be  transformed  according  to 
the  specific  transformation  of  the  latent  trait  9  to  T. 

Some  Necessary  Considerations 


In  using  the  generalized  method  of  the  operating  characteristic  estimation, 
which  was  described  in  the  preceding  section,  a  few  problems  must  be  considered. 
First  of  all,  the  constant  test  information  provided  by  the  subset  of  equivalent 
binary  items  with  the  CIM  should  be  substantially  large,  so  that  the  normal  ap¬ 
proximation  for  the  conditional  distribution  of  9,  given  9,  will  be  acceptable. 
On  the  other  hand,  a  substantially  wide  range  of  ability  9,  for  which  the  test 
information  is  constant,  is  needed  in  order  to  make  the  estimation  of  the  oper¬ 
ating  characteristics  of  the  other  items  in  the  item  pool  meaningful.  These  two 
are  opposing  factors,  as  is  obvious  from  Equations  15  and  17.  The  solution  for 
this  problem  is  to  use  a  substantially  large  number  of  equivalent  binary  items, 
whose  common  discrimination  parameter  is  low. 

Another  problem  is  the  effect  of  the  range  of  9  on  the  speed  of  convergence 
of  the  conditional  distribution  of  9,  given  9,  to  the  normal  distribution,  N(9, 

{kC}-i5).  Since  the  range  of  9  is  a  finite  interval  which  is  given  by  Equation 
33,  it  should  be  expected  that  the  truncation  of  the  conditional  distribution 
makes  the  convergence  slow  around  the  values  of  9  close  to  (-iTa^M)  +  b  and 

“1  © 

(7ra„  /4)  +  b  .  as  is  illustrated  in  Figure  6.  A  solution  for  this  problem 
~ 8  & 

is  again  to  use  a  subset  of  equivalent  binary  items  whose  common  discrimination 
parameter  is  low  so  that  the  range  of  9  is  wide  enough  to  include  all  the  exam¬ 
inees  sufficiently  inside  of  the  two  endpoints  of  the  interval  of  9.  An  alter¬ 
native  for  the  above  solution  is  to  exclude  examinees  whose  0's  are  close  to 
(-ira^M)  +  b  or  (Tra~1/4)  +  b^.  In  the  second  solution,  however,  the  number  of 

O  O  O  O 

examinees  usable  for  the  estimation  will  be  decreased,  and  this  may  substantial¬ 
ly  affect  the  accuracy  of  the  estimation  of  the  operating  characteristics.  It 
is  worth  noting  that  the  solution  for  the  first  problem  also  seems  to  be  the 
best  solution  for  the  second  problem. 


-  158  - 


Figure  6 

Graphical  Illustration  of  the  Conditional  Density 
Functions  of  the  Maximum  Likelihood  Estimate  0, 
Given  the  Latent  Trait  0 


Method ■  To  pursue  the  speed  of  convergence  that  the  conditional  distribu¬ 
tion  of  the  maximum  likelihood  estimate,  0,  approaches  normality,  N(0,  I(0)-Js  ), 
a  monte  carlo  study  was  done.  One  hundred  hypothetical  examinees  were  assumed 
to  exist  at  each  of  eight  different  levels  of  ability  0,  i.e.,  -3.0,  -2.2,  -1.4, 
-0.6,  0.2,  1.0,  1.8,  and  2.6.  It  was  assumed  that  each  examinee  had  taken  20 
sessions  of  equivalent  tests,  in  each  of  which  10  equivalent  binary  items  were 
given.  Each  binary  item  was  assumed  to  follow  the  CIM  with  the  parameters  ^  * 

"o 

.25  and  bg  *  0.00,  and  the  response  pattern  was  calibrated  by  the  monte  carlo 

method  for  each  examinee  in  each  of  the  20  hypothetical  sessions  of  testing. 

The  cumulative  test  score  was  calculated  after  each  session  by  summing  all  the 
binary  item  scores  that  the  examinee  had  obtained.  In  this  way,  the  number  of 
the  binary  items  used  for  the  computation  of  the  cumulative  test  score  was  10 
after  the  first  session,  20  after  the  second  session,  30  after  the  third  ses¬ 
sion,  and  200  after  the  completion  of  the  twentieth  session.  The  maximum  like¬ 
lihood  estimate  of  ability  was  obtained  for  each  of  the  800  examinees,  based 
upon  each  cumulative  test  score  following  the  method  described  in  the  preceding 
section. 

Results.  Figure  7  presents  the  resultant  cumulative  frequency  distribution 


00'  l 


Figure  7 

Cumulative  Frequency  Distributions  of  the  100  Maximum  Likelihood  Estimates 
for  0=-2.2  for  Group  2,  Based  on  10,  20,  50,  100, and  200  Items 
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of  the  maximum  likelihood  estimates  of  the  100  examinees  whose  ability  level  was 
-2.2  (i.e.,  relatively  close  to  the  lower  endpoint,  -it,  of  the  interval  for 
which  the  item  information,  1^(8)  assumes  the  positive  value,  -.25),  after  the 

completion  of  each  of  Sessions  1,  2,  5,  10,  and  20.  In  each  graph,  the  normal 

distribution  functions  are  also  presented  with  8  (»  -2.2)  and  1(8) as  the  two 
parameters  and  with  the  mean  and  the  standard  deviation  of  the  observed  100  max¬ 
imum  likelihood  estimates.  As  was  expected,  the  cumulative  frequency  distribu¬ 
tion  shows  a  substantial  skewness  to  the  positive  side,  when  the  number  of  bi¬ 
nary  items  is  as  small  as  10;  and  therefore  the  normal  distribution  function 
with  the  two  empirically  obtained  parameters  provides  a  curve  that  is  located 

further  to  the  right  side  of  N(8,  I(0)->5).  This  tendency  still  holds,  though 
slightly,  even  when  the  number  of  binary  items  is  as  large  as  100. 

For  the  purpose  of  comparison,  Figure  8  presents  a  similar  set  of  graphs  for 
another  group  of  100  hypothetical  examinees  whose  ability  levels  were  uniformly 
.2  (i.e.,  a  value  which  is  far  from  the  two  endpoints,  — it  and  tt,  of  the  interval 
for  which  the  item  information  function  of  each  equivalent  binary  item  assumes 
the  positive  constant).  The  results  illustrated  in  Figure  8  make  a  good  con¬ 
trast  with  those  in  Figure  7,  in  which  the  two  normal  distribution  curves  almost 
overlap  each  other,  even  when  the  number  of  items  used  for  obtaining  the  maximum 
likelihood  estimate  is  as  small  as  20.  This  indicates  a  much  faster  convergence 
of  the  conditional  distribution  of  the  maximum  likelihood  estimate  to  normality 

with  8  and  1(8)^  as  the  two  parameters,  in  comparison  with  the  case  in  which 
the  ability  level  is  closer  to  one  of  the  two  endpoints  of  the  interval,  (-it, 
tt)  • 


If  the  latent  trait  8  is  transformed  to  T  through  Equations  38  and  39  so 
that  each  equivalent  binary  item  follows  the  normal  ogive  model  with  a*  =*  1.00 

and  b*  =  0.00,  then  the  values  of  x  corresponding  to  0  =  -2.2  and  0  =  0.2  are, 

O 

approximately,  -1.60  and  .13.  If  Equation  44  is  used  for  the  transformation  so 
that  each  equivalent  binary  item  follows  the  logistic  model  with  the  same  param¬ 
eters — a*  and  b*— with  the  scaling  factor  D  =  1.7,  then  these  corresponding  val- 

o  o 

ues  of  x  are  approximately  -1.68  and  .12,  respectively.  A  similar  transforma¬ 
tion  by  means  of  Equation  41,  which  provides  the  linear  model  with  the  parame¬ 
ters  a  =  -2.5  and  6  =  2.5  for  each  of  the  equivalent  binary  items,  results  in 

o  o 

-2.23  and  .25  as  the  approximate  values  of  X  corresponding  to  9=  -2.2  and  8  « 
0.2,  respectively. 


Discussion  and  Conclusion 


A  new  model  for  a  binary  item,  the  Constant  Information  Model,  has  been  in¬ 
troduced,  and  its  characteristics  and  usefulness  have  been  discussed.  As  was 
pointed  out  (Samejima,  1975,  1977a,  1977b,  1977c,  1977d,  1978a,  1978b,  1978c, 
1978d,  1978e,  1978f),  latent  trait  theory  enlarges  its  horizon  if  full  use  is 
made  of  information  functions,  enabling  types  of  research  to  be  conducted  which 
could  not  otherwise  be  done.  For  this  reason,  the  Constant  Information  Model 
will  contribute  to  the  productivity  of  research  in  the  area  of  computerized 
adaptive  testing  as  well  as  in  other  areas,  as  exemplified  in  this  paper. 
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Discussion:  Session  4 


Robert  Tsutakawa 
University  of  Missouri 


In  reading  Samejima's  paper,  one  soon  realizes  that  her  ideas  are  quite 
different  and  provocative:  She  appears  to  be  knocking  on  the  door  of  the  foun¬ 
dations  of  latent  trait  theory.  In  my  discussion  of  her  paper  I  will  attempt  to 
provide  appropriate  motivation  for  the  Constant  Information  Model  (CIM)  and 
point  out  its  relation  to  other  statistical  methods,  reviewing  some  of  the  im¬ 
portant  issues  and  raising  points  that  could  be  discussed  further. 

The  problem  that  motivates  the  CIM  is  not  unique  to  mental  testing.  Gener¬ 
ally,  in  estimating  a  parameter  6,  the  estimator  will  have  a  variance  depending 
on  the  unknown  parameter  0.  When  estimating  several  parameters  0^,...,0N,  say, 

the  ability  of  N  people,  there  will  be  estimators  with  different  variances.  (An 
important  exception  to  the  general  property  is  the  normal  linear  model  where  the 
constancy  of  variance  plays  an  important  role.)  If  Interest  is  in  making  sta¬ 
tistical  inferences  based  on  the  estimated  values,  the  constancy  of  variance 
will  open  up  a  variety  of  statistical  procedures,  such  as  the  analysis  of  vari¬ 
ance.  Moreover,  in  designing  an  experiment  the  constancy  of  variance  will  per¬ 
mit  the  running  of  an  experiment  with  guaranteed  precision. 

Consider  the  item  characteristic  function  of  the  CIM  given  by 

P  (0)  =  sin2  [a  (0  -  b  )  +  tt  /  4  ]  ,  0  <  9  <  e"  .  [1] 

9  9  9  ~9  9 

If  its  value  is  denoted  by  £,  the  inverse  transformation  is 

0  =  ag~l  sin"1  Sp  -  JZT  +  bg,  0  <  p  <  1  .  [2] 

9 

In  this  form  it  can  be  noted  that  this  is  essentially  the  arc  sine  transforma¬ 
tion,  which  was  used  in  the  1930s  and  1940s  to  stabilize  the  variance  of  binomi¬ 
al  proportions  and  to  obtain  a  better  normal  approximation.  The  arc  sine  trans¬ 
formation  is  also  used  by  Bayesians  for  binomial  samples  to  achieve  likelihood 
functions  that  are  "data  translated”  and  was  used  by  Jeffereys  to  obtain  the 
non in formative  prior  for  an  unknown  proportion.  Samejima's  transformation 
should  thus  be  of  particular  interest  to  Bayesians. 

Regarding  the  range  of  0  in  the  CIM,  it  is  disconcerting  to  restrict  0  to 
intervals  depending  on  the  item.  It  seems  more  realistic  to  extend  the  range  to 
the  whole  real  line  by  defining  it  as 
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Io  if  e  <  e 
- 9 

as  above  if  0  <  0  <  0 

-g  9 

1  if  0  >  F  .  [3] 

-  9 

With  this  extension,  different  response  functions  can  be  dealt  with,  allowing 
for  the  probability  of_a  correct  response  on  a  given  item,  £,  to  be  1  whenever 
the  ability  0  exceeds  0  and  0  whenever  it  is  less  than  0^.  Note  that  when  0 

_  o  o 

0  or  0  £  j)  ,  the  formal  information  is  0.  However,  the  experimenter  may  have 
some  idea  about  whether  0  is  very  low  or  very  high. 

Given  a  fixed  number  of  items  ji,  equivalent  tests  are  not  only  a  conven¬ 
ience  but  a  practical  necessity  in  order  to  attain  constant  total  information 
over  all  0  in  the  range  of  interest.  With  nonequivalent  tests  the  total  infor¬ 
mation  will  generally  (with  rare  exceptions)  depend  on  0.  This  raises  the  im¬ 
portant  question  about  how  to  find  a  set  of  equivalent  items. 

There  are  two  simple  ways  of  constructing  tests  so  the  items  are  equiva¬ 
lent.  One  is  to  use  the  £  items  in  random  order.  In  this  case  the  probability 
of  a  correct  response  is  constant  for  any  ability  0;  the  assumption  of  local 
independence,  however,  is  violated.  Another  method  is  to  select  the  n  items  at 
random  from  a  large  pool.  In  this  case,  the  probability  of  a  correct  response 
for  ability  0  is  equal  to  the  average  probability  of  correct  response  (averaged 
over  the  whole  pool),  and  the  assumption  of  local  independence  can  be  defended. 
If  the  items  are  very  different,  however,  Bayesians  would  consider  this  poor 
practice. 

Given  a  set  of  n  nonrandom  items,  testing  their  equivalence  is  a  chal¬ 
lenging  statistical  problem.  Although  Samejima  has  given  some  suggestions,  con¬ 
siderably  more  work  is  needed  before  these  suggestions  can  be  put  into  practice. 

There  are  implications  for  the  use  of  the  Constant  Information  Model  in 
tailored  testing.  It  seems  reasonable  to  start  with  items  with  low  ag  at  the 

beginning  when  the  location  of  0  is  uncertain,  and  then  use  those  with  higher  a_ 

— 8 

as  the  region  in  which  0  is  likely  to  belong  is  narrowed.  With  a  good  selection 
rule,  this  is  likely  to  be  more  efficient  than  having  a  large  number  of  equiva¬ 
lent  items  with  low  Sg.  The  gain  in  efficiency  will  more  than  offset  the  incon¬ 
venience  of  not  having  constant  total  information. 
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A  Test  of  the  Adequacy  of  Curvilinear  Score 
Equating  Models 


Gary  Marco,  Nancy  Petersen,  and  Elizabeth  Stewart 
Educational  Testing  Service 


In  many  common  testing  situations  it  is  necessary  to  compare  the  test 
scores  of  examinees  who  have  taken  different  forms  of  a  test.  In  practice,  two 
forms  of  a  test  cannot  be  expected  to  be  of  exactly  equal  difficulty  for  exam¬ 
inees  at  all  ability  levels.  Therefore,  a  comparison  of  raw  scores  on  two  forms 
of  a  test  will  be  unfair  to  the  examinees  who  have  taken  the  more  difficult 
form.  Statistical  procedures  that  have  been  developed  to  deal  with  this  problem 
are  referred  to  as  equating  methods. 

In  an  ideal  psychometric  world,  tests  on  which  scores  need  to  be  equated 
would  be  parallel  in  all  important  respects:  An  anchor  test,  if  used,  would  be 
parallel  to  the  total  tests,  and  random  samples  on  which  to  base  the  equating 
would  always  be  available.  In  actual  testing  practice,  however,  scores  must 
sometimes  be  equated  under  less  than  optimum  conditions.  This  study  is  the 
first  part  of  a  larger  study,  the  purpose  of  which  is  to  examine  the  adequacy  of 
score-equating  models  when  certain  sample  and  test  characteristics  are  systemat¬ 
ically  varied.  The  emphasis  in  this  part  of  the  study  is  on  curvilinear  models, 
whereas  the  second  part  focuses  on  linear  models.  This  study  is  more  comprehen¬ 
sive  than  previous  studies  of  equating  models  (e.g.,  Levine,  1955;  Rentz  & 
Bashaw,  1975;  Slinde  &  Linn,  1977,  1978;  Tucker,  1974)  in  that  it  includes  a 
greater  variety  of  equipercentile,  linear,  and  ICC  models  and  investigates 
equatings  based  on  dissimilar  samples  as  well  as  on  random  samples. 


EQUATING  MODELS 

An  equating  method  is  an  empirical  procedure  for  determining  a  transforma¬ 
tion  to  be  applied  to  the  scores  on  one  of  two  forms  of  a  test.  Its  purpose  is, 
ideally,  to  transform  the  scores  in  such  a  way  that  it  makes  no  difference  to 
the  examinee  which  form  of  the  test  he  or  she  takes.  This  ideal  can  be  reached 
only  if  (1)  the  two  forms  of  the  test  measure  exactly  the  same  latent  trait 
(ability  or  skill)  and  yield  scores  that  are  equally  reliable  and  (2)  the  equat¬ 
ing  transformation  is  invertible. 

Because  an  equating  method  is  an  empirical  procedure,  it  involves  a  design 
for  data  collection  and  a  rule  for  determining  the  transformation.  There  are 
three  basic  designs  for  data  collection  and  three  general  rules  for  determining 
the  transformation.  Any  of  the  three  designs  for  data  collection  can  be  used 
with  any  of  the  three  transformation  rules. 


i 
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Data  Collection  Designs 


The  three  designs  for  data  collection  are  the  single-group  method,  the 
equivalent  group  method,  and  the  anchor  test  method  (Lord,  1975).  All  of  the 
equating  procedures  used  In  this  study  assume  that  the  data  were  collected  using 
the  anchor  test  method.  An  anchor  test  design  requires  administering  one  form 
of  a  total  test  to  one  group  of  examinees,  a  second  form  to  a  second  group  of 
examinees,  and  a  common  anchor  test  to  both  groups.  The  anchor  test  can  be 
either  Internal  or  external  to  the  tests  to  be  equated.  The  anchor  test  Is  used 
to  reduce  equating  bias  resulting  from  differences  In  ability  between  the  two 
groups. 

Transformation  Rules 


The  three  general  rules  for  determining  the  transformation  are: 

1.  Equipercentile  equating.  Choose  a  transformation  such  that  scores  from 
the  two  tests  will  be  equated  if  they  correspond  to  the  same  percentile 
rank  in  some  group  of  examinees. 

2.  Linear  equating.  Choose  a ‘linear  transformation  such  that  scores  from 
the  two  tests  will  be  equated  if  they  correspond  to  the  same  number  of 
standard  deviations  from  the  mean  in  some  group  of  examinees. 

3.  Item  characteristic  curve  (ICC)  equating.  Choose  a  transformation  such 
that  true  scores  from  the  two  tests  will  be  equated  If  they  correspond 
to  the  same  estimated  level  of  the  latent  trait  underlying  both  tests. 

All  three  types  of  equating  were  represented  in  this  study. 

Equipercentile  and  linear  methods  using  anchor  test  data  can  be  further 
classified  as  to  whether  the  equating  is  done  directly  or  indirectly  by  frequen¬ 
cy  estimation.  In  direct  equipercentile  equating,  scores  on  each  test  and  on 
the  anchor  test  are  first  equated  separately  within  each  group.  Then,  scores  on 
the  two  tests  to  be  equated  are  said  to  be  equivalent  if  they  correspond  to  the 
same  score  on  the  anchor  test.  Frequency  estimation  (Angoff,  1971,  pp.  581- 
582),  on  the  other  hand,  makes  use  of  the  combined  distribution  of  scores  on 
the  anchor  test.  The  score  distributions  of  the  tests  to  be  equated  are  esti¬ 
mated  for  the  combined  group  of  examinees,  and  these  estimated  distributions  are 
then  used  as  if  they  had  been  observed  from  a  single-group  design.  The  result¬ 
ing  marginal  distributions  on  the  two  forms  are  used  in  the  case  of  equipercen¬ 
tile  equating;  the  resulting  estimated  means  and  standard  deviations  on  the  two 
forms  are  used  in  the  case  of  linear  equating. 

Operationalizing  the  Models 

Many  of  the  linear  methods  require  error  variance  estimates.  The  three 
methods  of  estimating  error  variances  that  were  used  in  this  study  were  Angoff's 
(1953)  method,  which  uses  anchor  test  data;  Feldt's  method  (1975),  which  uses 
part-test  data;  and  coefficient  alpha,  which  uses  item-response  data. 
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One  of  the  ICC  procedures  utilized  the  ICC  parameters  (item  difficulties) 
from  the  1-parameter  logistic  test  model;  the  other  utilized  the  ICC  parameters 
from  the  3-parameter  logistic  test  model.  The  computer  program  LOGIST  (Wood  & 
Lord,  1976;  Wood,  Wingersky,  &  Lord,  1976)  was  used  to  estimate  item  parameters 
and  examinee  abilities.  For  both  models,  true  formula  scores  (R-W/4)  were 
equated  by  calculating  the  true  formula  scores  on  each  test  form  corresponding 
to  selected  ability  levels  (Lord,  1975)  and  interpolating  as  necessary.  Since 
for  either  model  there  is  a  functional  relationship  between  ability  and  true 
score,  the  true  formula  score  is  readily  computed  by  the  equation 

Rg  ~  ~  Rg)/(A  "  1)*  m 

where 

R  is  the  true  number-correct  score  at  ability  g  computed  by  summing  the 
item  proportions  correct  under  the  model, 

N  is  the  number  of  test  items,  and 

A  is  the  number  of  response  options  for  the  items  in  the  test  (five  choices 
in  all  cases). 

The  various  equating  models  used  in  this  study  are  described  briefly  in 
Table  1.  The  models  are  categorized  by  whether  the  procedure  results  in  a  lin¬ 
ear  or  curvilinear  transformation  between  observed  or  true  scores.  The  table 
provides  references  and  information  as  to  the  major  assumption  underlying  each 
model,  the  kind  of  data  required,  and  whether  specific  error  variance  estimates 
are  needed.  If  the  codes  for  the  model  differ  for  an  external  and  an  internal 
anchor  test,  then  the  formulas  for  computing  the  transformation  parameters  dif¬ 
fer  in  the  two  cases.  A  total  of  40  linear  (including  4  based  on  the  marginal 
means  and  standard  deviations  resulting  from  frequency  estimation),  2  equiper- 
centile,  and  2  ICC  equating  models  were  used  in  the  study. 


STUDY  DESIGN 

Computer  files  for  two  nation'al  administrations  (April  1975  and  November 
1975)  of  the  verbal  portion  of  the  College  Board  Scholastic  Aptitude  Test  were 
obtained  from  Educational  Testing  Service  (ETS).  Fourteen  pairs  of  total  test 
scores  were  equated  on  the  basis  of  the  data  from  each  of  the  two  administrar- 
tions.  Records  of  test  scores,  responses  to  test  items,  and  responses  to  a  Stu¬ 
dent  Descriptive  Questionnaire  (SDQ)  were  accessed  to  construct  samples  having 
specified  characteristics,  to  extract  or  to  calculate  selected  scores  for  a  num¬ 
ber  of  special  purpose  total  tests  and  anchor  tests,  and  to  compute  item  sta¬ 
tistics  and  other  data  needed  for  the  various  equating  models. 

Equating  Design 


The  combinations  of  total  test  and  anchor  test  used  in  the  various  equat- 
ings  for  each  of  the  two  administrations  can  be  classified  into  five  categor¬ 
ies,  referred  to  here  as  test  variations.  (There  were  11  test  variations  used 
in  the  full  study  of  which  this  study  is  a  part.) 
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Equating  a  Test  to  Itself 

In  this  part  of  the  study  (see  Table  2),  a  single  form  of  the  operational 
verbal  portion  of  the  SAT  (SAT-V)  was  treated  as  If  it  represented  two  different 
forms;  that  is,  it  was  to  be  equated  to  itself.  (This  type  of  design  was  first 
used  by  Levine,  1955.)  This  part  of  the  study  was  designed  to  investigate  the 
effects  of  varying  (1)  the  relative  difficulty  levels  of  the  total  test  and  the 
anchor  test  and  (2)  the  degree  of  similarity  between  the  two  samples  on  which 
the  equating  operations  were  based. 


Table  2 

Design  for  Equating  a  Medium-Difficulty  Test 
to  Itself  through  an  Anchor  Test  of  Similar  Content 


Test 

Variation 

Anchor  Test 
Location  Difficulty 

Relation 

Between 

Samples 

SAT-Verbal 
New  Form 
Sample 

Score  Level 
Old  Form 
Sample 

1 

External 

Medium 

Random 

Dissimilar 

Middle 

Middle 

Middle 

High 

4 

Internal 

Medium 

Random 

Dissimilar 

Middle 

Middle 

Middle 

High 

5 

Internal 

Easy 

Random 

Dissimilar 

Middle 

Middle 

Middle 

High 

Internal 

Hard 

Random 

Dissimilar 

Middle 

Middle 

Middle 

High 

SAT-V  was  equated  to  itself  through  two  anchor  tests  that,  like  the  total 
test,  were  of  medium  difficulty.  One  was  external  (Test  Variation  1),  and  one 
was  internal  (Test  Variation  4).  SAT-V  was  also  equated  to  itself  through  two 
internal  anchor  tests  that  differed  from  it  in  difficulty  (Test  Variation  5). 

One  of  the  internal  anchor  tests  was  easier  than  SAT-V;  and  the  other,  more  dif¬ 
ficult. 

For  each  administration  a  single  pair  of  random  samples  and  a  single  pair 
of  dissimilar  samples  were  used  for  all  equatings  of  SAT-V  to  itself.  The  dis¬ 
similar  samples,  by  virtue  of  the  sample  selection  procedure,  were  expected  to 
be  of  middle  and  high  verbal  ability,  respectively. 

Equating  Tests  of  Different  Difficulty 

In  this  part  of  the  study  (see  Table  3),  three  total  tests  were  constructed 
for  each  administration  from  a  pool  comprising  the  operational  SAT-V  items  and 
the  items  in  a  nonoperational  section  of  verbal  material.  The  purpose  of  this 
part  of  the  study  was  to  examine  the  effects  of  varying  (1)  the  relative  diffi¬ 
culty  levels  of  the  two  total  tests  on  which  scores  were  to  be  equated  and  (2) 
the  degree  of  similarity  between  the  two  equating  samples. 
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Table  3 

Design  for  Equating  Tests  that  Differed  Only  in  Difficulty  through 
an  Internal  Anchor  Test  of  Similar  Content  and  Medium  Difficulty 


Test 

Variation 

Total  Test  Difficulty 
New  Form  Old  Form 

Relation 

Between 

Samples 

SAT-Verbal 
New  Form 
Sample 

Score  Level 
Old  Form 
Sample 

8 

Easy 

Medium 

Rand  om 

Middle 

Middle 

Dissimilar 

Low 

Middle 

Medium 

Hard 

Random 

Middle 

Middle 

Dissimilar 

Middle 

High 

9 

Easy 

Hard 

Random 

Middle 

Middle 

Dissimilar 

Low 

High 

Pairs  of 

total  tests 

constructed 

to  differ  in  difficulty  were 

equated 

through  an  internal  anchor  test  of  medium  difficulty.  The  random  and  dissimilar 
samples  used  in  equating  SAT-V  to  itself  were  used  in  these  equatings  also,  a- 
long  with  an  additional  sample  expected  to  be  of  low  verbal  ability. 

In  Test  Variation  8  an  easy  test  was  equated  to  a  medium-difficulty  test 
and  the  medium-difficulty  test  was  equated  to  a  hard  test.  In  Test  Variation  9 
the  easy  test  was  equated  to  the  hard  test.  For  equatings  based  on  dissimilar 
samples,  data  from  the  low-ability  sample  was  used  for  the  easy  test;  from  the 
middle-ability  sample,  for  the  medium-difficulty  test;  and  from  the  high-ability 
sample,  for  the  hard  test. 


Tests 


Several  scores  calculated  as  part  of  the  normal  processing  for  SAT  adminis¬ 
trations  were  included  in  the  study.  Included  also  were  scores  computed  on  a 
number  of  tests  constructed  retrospectively  solely  for  use  in  the  study.  The 
general  approach  followed  in  constructing  the  special  purpose  tests  entailed  (1) 
developing  content  and  statistical  specifications  for  all  special  purpose  tests, 
(2)  identifying  sets  of  items  in  accordance  with  those  specifications,  and  (3) 
identifying  subsets  of  items  on  which  separate  scores  were  to  be  obtained  for 
use  in  calculating  one  of  the  three  sets  of  reliability  estimates  required  for 
the  equating  analyses. 

Test  content  was  specified  only  in  terms  of  distributions  of  item  types, 
although  more  detailed  specifications  are  followed  in  developing  operational 
forms  of  SAT-V.  SAT-V  is  composed  of  three  types  of  discrete  five-choice 
items — antonyms  (25  items),  analogies  (20  items),  and  sentence  completions  (15 
items) — and  of  five  reading  comprehension  passages,  each  of  which  is  followed  by 
5  five-choice  items  based  on  the  passage. 

Statistical  specifications  were  stated  in  terms  of  the  item  statistics  con- 
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ventionally  used  in  the  development  of  ETS  tests  (described  by  Angoff  &  Dyer, 
1971,  pp.  9-10).  The  equated  delta  (Ag)  served  as  the  index  of  item  difficulty; 

and  the  biserial  correlation  (jjj)  of  the  item  score  with  the  total  score  on  the 

operational  test  in  which  the  item  appeared,  as  the  index  of  item  discrimina¬ 
tion.  The  statistic  Ag  is  an  estimate  of  the  difficulty  of  the  item  for  a  stan¬ 
dard  reference  group.  It  ranges  in  value  from  about  6  (very  easy)  to  18  (very 
hard).  If  a  test  composed  of  five-choice  items  were  of  middle  difficulty  for 
the  reference  group,  its  mean  Ae  would  be  12.0. 

The  item  statistics  used  to  construct  the  tests  were  taken  from  the  results 
of  item  analyses  routinely  conducted  for  each  new  form  of  the  SAT-V.  The  analy¬ 
ses  were  based  on  systematic  samples  of  approximately  1,700  to  2,000  examinees 
each.  After  all  full-length  total  tests  and  anchor  tests  had  been  identified, 
part  tests  for  use  in  reliability  estimation  (Feldt,  1975)  were  created.  Each 
part  test  was  parallel,  except  for  length,  to  the  full-length  test  from  which  it 
was  derived. 

Tests  Used  in  Equating  a  Test  Itself 

For  each  administration,  scores  were  available  on  an  external  anchor  test 
(a  nonoperational  section  of  verbal  material)  similar  in  content  to  the  SAT-V. 
The  external  anchor  tests  each  contained  equal  numbers  of  items  of  the  four 
types  included  in  the  SAT-V.  The  difficulties  of  the  external  anchor  tests  were 
not  subject  to  experimentation.  The  mean  difficulties  of  the  external  anchor 
tests  were  within  about  one-half  a  Afi  point  of  the  mean  for  SAT-V,  but  both  the 

standard  deviations  of  the  A  's  (a.  )  and  the  mean  r.  's  tended  to  be  somewhat 

A  — ° 

lower  for  the  anchor  tests.  e 

For  each  administration,  three  internal  anchor  tests  for  equating  SAT-V  to 
itself  were  specially  constructed  for  the  study  from  the  pool  of  85  operational 
items  in  each  form.  The  internal  anchor  tests  were  constructed  to  be  similar  in 
content  to  SAT-V  but  to  vary  with  regard  to  each  other  in  mean  difficulty.  In¬ 
ternal  anchor  tests  constructed  from  the  April  1975  item  pool  each  contained  10 
antonym,  6  analogy,  8  sentence  completion,  and  10  reading  comprehension  items. 
The  number  of  items  in  these  respective  categories  from  the  November  1975  item 
pool  were  10,  8,  8,  and  10.  For  the  medium-difficulty  anchor  tests,  O.  , 

and  rfe  were  made  to  match  the  corresponding  values  for  SAT-V  as  closely  as  pos¬ 
sible,  given  the  prescribed  content  distribution.  The  a ^  's  for  the  easy  and 

e 

hard  anchor  tests  were,  of  necessity,  smaller  than  those  for  SAT-V  and  the  medi¬ 
um-difficulty  anchor  test.  The  item  summary  statistics  and  identification  codes 
for  the  total  tests  and  the  anchor  tests  used  in  equating  a  test  to  itself  are 
given  in  Table  4. 

Tests  Used  in  Equating  Different  Tests 

For  each  administration,  an  expanded  item  pool  consisting  of  the  85  SAT-V 
items  and  the  40  items  in  the  verbal  external  anchor  test  was  used  for  creating 
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three  total  tests.  The  tests  within  each  set  were  systematically  different  in 
average  difficulty  but  were  similar  in  content  and  equal  in  length.  The  total 
tests  constructed  from  the  April  1975  and  the  November  1975  item  pools  each  con¬ 
tained  15  antonym,  13  analogy,  11  sentence  completion,  and  15  reading  comprehen¬ 
sion  items. 

For  each  administration  a  20-item  internal  anchor  test  of  medium  diffi¬ 
culty,  similar  in  content  to  SAT-V,  was  constructed  for  use  in  equating  the  spe¬ 
cial-purpose  total  tests.  The  internal  anchor  tests  constructed  from  the  April 
1975  and  the  November  1975  item  pools  each  contained  6  antonym,  5  analogy,  4 
sentence  completion,  and  5  reading  comprehension  items.  The  item  summary  sta¬ 
tistics  and  identification  codes  for  the  total  tests  and  the  anchor  tests  used 
in  equating  different  tests  are  given  in  Table  4. 

Samples 


Two  base  samples  were  used  for  the  study,  one  each  from  the  April  1975  and 
the  November  1975  Saturday  administrations  of  the  SAT-V.  The  April  base  sample 
(No.  32)  was  selected  from  those  candidates  who  took  Verbal  Equating  Test  FM; 
the  November  base  sample  (No.  44),  from  those  who  took  Verbal  Equating  Test  FG. 
(Six  base  samples  were  used  in  the  full  study.) 

Each  base  sample  consisted  of  4,731  cases,  from  which  5  subsamples  of  1,577 
cases  each  were  created  in  two  different  ways.  Two  nonoverlapping  subsamples 
("random"  samples)  were  selected  by  use  of  an  IBM  recursive  random  number  gener¬ 
ator.  Three  nonoverlapping  subsamples  ("dissimilar”  samples)  were  selected  by 
an  algorithm  designed  to  yield  samples  dissimilar  in  mean  verbal  ability.  Two 
variables  from  the  SDQ-- level  of  educational  aspiration  and  amount  of  high- 
school  foreign  language  training — were  used  to  select  the  dissimilar  samples. 
These  variables  were  known  from  prior  information  to  have  a  high  relationship 
with  SAT-V  scores. 

Thus,  a  total  of  10  subsamples  of  1,577  cases  each,  5  for  each  of  the  2 
base  samples,  were  used  in  the  study.  Tables  5  and  6  give  the  summary  statis¬ 
tics  for  the  total  tests  and  the  anchor  tests  used  in  the  various  equating  anal¬ 
yses. 


Evaluative  Procedures 


Discrepancy  Indices 


For  each  raw  score  x  there  is  a  corresponding  criterion  score  _t  and  an  es¬ 
timated  criterion  score  _t'  derived  from  a  specific  equating  model.  The  smaller 
the  difference  d^  between  _t '  and  t^,  the  smaller  the  equating  error  and  the  more 
appropriate  the  equating  model. 

The  standardized  weighted  mean  square  difference  and  squared  bias  were  se¬ 
lected  as  the  most  useful  summary  indices  for  evaluating  the  effectiveness  of 
the  various  models.  The  weighted  mean  square  difference  gives  the  greatest 
weight  to  those  values  of  x  that  are  most  likely  to  occur  and  is  consistent  with 
what  is  used  to  represent  total  error  in  the  statistical  literature.  The  in- 
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Table  4 

Item  Summary  Statistics  and  Identification  Codes  for  Total  Tests 
and  Anchor  Tests  Used  in  Equating  at  Two  Administrations 


Administration 

Test  Description 

ID 

n 

*e 

% 

rb 

Equating  a  Te°t  to 
April  197. 

Itself 

Total  Test 

SAT-V 

FT2XX 

85 

11.40 

3.38 

.47 

Anchor  Test 
External 

FE2FM 

40 

11.14 

3.06 

.43 

Internal 

Easy 

FE2DE 

34 

9.39 

2.90 

.49 

Medium 

FE2DM 

34 

11.40 

3.26 

.49 

Hard 

FE2DH 

34 

12.96 

2.49 

.47 

November  1975 

Total  Test 

SAT-V 

FT4XX 

85 

11.36 

3.40 

.48* 

Anchor  Test 
External 

FE4FG 

40 

12.01 

2.95 

.43 

Internal 

Easy 

FE4DE 

36 

9.40 

2.68 

.51 

Medium 

FE4DM 

36 

11.44 

3.43 

.49 

Hard 

FE4DH 

36 

13.29 

2.18 

.48 

Equating  Different 
April  1975 

Tests 

Total  Tests 

Easy 

FT2DE 

54 

9.34 

2.89 

.46 

Medium 

FT2DM 

54 

11.34 

2.80 

.46 

Hard 

FT2DH 

54 

13.26 

2.95 

.44 

Anchor  Test 
Internal 

FE2PA 

20 

11.31 

3.26 

.46 

November  1975 

Total  Tests 

Easy 

FT4DE 

54 

9.59 

2.94 

.50 

Medium 

FT4DM 

54 

11.69 

2.49 

.46 

Hard 

FT4DH 

54 

13.51 

2.77 

.45 

Anchor  Test 
Internal 

FE4PA 

20 

11.58 

3.05 

.47 

♦Based  on  84  of  the  85  items. 


dices  were  standardized  (expressed  as  a  proportion  of  the  criterion  standard 
deviation)  so  that  results  could  be  compared  across  equating  situations  as  well 
as  across  equating  models. 

The  standardized  weighted  mean  square  difference  or  total  error  is  equal  to 
the  variance  of  the  difference  plus  the  squared  bias,  that  is, 

Zf.d./ns,  =  Zf.{d.  -d)2 /net  +  d 2  /e*  ,  [2] 

j  3  3  v  j  3  3  c 

or  Total  Error  «  Variance  of  Difference  +  Squared  Bias, 


where 


ij  -  <‘j  -£j>! 


the  estimated  criterion  score  for  raw  score  Xjj 


=  the  criterion  score  for  Xj5 

2  =  /n> 

j  j  j 

=  the  standard  deviation  of  the  criterion  scores  t 


It 


_f^  =  the  frequency  of  x^; 
n  -  Zf.; 
j-j 

and  the  summation  was  over  that  range  of  x  for  which  extrapolation  was  unneces¬ 
sary  for  any  of  the  models  studied.  If  the  ratio  of  the  squared  bias  to  the 
total  error  is  1,  then  the  criterion  line  and  the  conversion  line  are  parallel. 
If  the  difference  is  less  than  1,  then  there  is  an  interaction  between  the 
model  and  the  criterion. 


Criterion  Equatlngs 


In  the  case  in  which  a  test  was  equated  to  itself,  the  criterion  for  the 
various  equatings  was  the  test  score  itself.  The  new  and  old  forms  were  treated 
as  different  tests,  when  in  reality  they  were  one  and  the  same  test.  The  ideal 
equating  would  reproduce  the  score  on  the  old  form  exactly;  that  is,  the  conver¬ 
sions  from  raw  to  scaled  scores  would  be  the  same  for  the  new  form  as  for  the 
old  form.  The  criterion  was  not  so  simply  established  in  the  case  in  which  a 
test  was  equated  to  a  different  test.  In  these  instances,  it  was  necessary  to 
calculate  "true"  conversions  by  equating  the  tests  in  as  ideal  a  manner  as  pos¬ 
sible. 

The  criterion  equatings  were  accomplished  using  data  from  all  of  the  cases 
in  the  two  base  samples  (No.  32  and  No.  44)  from  which  subsamples  were  selected. 
Since  all  4,731  cases  in  each  base  sample  had  scores  on  the  tests  being  equated, 
it  was  possible  to  equate  the  scores  using  a  single  sample,  an  ideal  equating 
situation.  The  scores  could  be  linked  directly  without  involving  an  anchor 
test. 


Two  equating  methods  were  used  to  establish  the  two  criteria  against  which 
to  compare  the  results  of  the  experimental  equatings:  equipercentile  equating  of 
estimated  true  scores  derived  from  the  3-parameter  logistic  test  model  (the  ICC 
equipercentile  criterion)  and  equipercentile  equating  of  observed  scores  (the 
direct  equipercentile  criterion).  Although  the  criterion  equatings  using  esti¬ 
mated  true  scores  were  based  on  ICC  methodology,  the  method  was  different  from 
that  used  in  the  experimental  equatings.  Nevertheless,  the  ICC  equipercentile 
criterion  could  be  biased  in  favor  of  ICC  equating  methods.  Thus,  the  direct 
equipercentile  criterion  was  also  used.  This  criterion  might  be  biased  in  favor 
of  equipercentile,  but  not  ICC,  methods.  For  the  study  of  the  curvilinear  meth¬ 
ods  reported  here,  both  methods  were  appropriate,  since  the  total  tests  were 
equal  in  length  and  yielded  scores  with  nearly  equal  reliabilities.  (True  score 
equating  will  generally  yield  better  conversions  than  observed  score  equating 
when  the  new  and  old  form  scores  have  unequal  reliabilities.) 
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Table  5 

Formula  Score  Means,  Standard  Deviations,  and  Correlations*  Between 
Anchor  Test  and  Total  Test  Scores  for  Base  Sample  No.  32  (April  1975) 


Samples*1 

No .  of  Random 

Dissimilar 

Test 

Items 

321XX 

322XX 

327XX 

328XX 

329XX 

Equating  a  Test 
to  Itself 

Total  Test 

FT2XX 

85 

Mean 

34.47 

34.67 

32.95 

40.28 

SD 

15.48 

15.57 

14.47 

14.82 

Anchor  Test 

FE2DE 

34 

Mean 

19.91 

19.96 

19.43 

22.27 

SD 

6.78 

6.72 

6.46 

6.00 

ixv 

.93 

.93 

.92 

.92 

FE2DM 

34 

Mean 

14.00 

14.12 

13.39 

16.56 

SD 

6.98 

6.98 

6.58 

6.66 

.94 

.94 

.93 

.94 

FE2DH 

34 

Mean 

9.15 

9.27 

8.45 

11.83 

SD 

7.35 

7.36 

6.90 

7.37 

Ixv 

.94 

.94 

.93 

.95 

FE2FM 

40 

Mean 

16.99 

16.63 

16.01 

19.52 

SD 

8.06 

8.06 

7.78 

7.86 

itv 

.87 

.87 

.85 

.86 

Equating  a  Test  to 
a  Different  Test 

Total  Test 

FT4DE 

54 

Mean 

31.61 

27.94 

SD 

10.17 

10.20 

Lxv 

.88 

.86 

FT4DM 

54 

Mean 

21.26 

21.14 

21.47 

SD 

11.67 

11.08 

10.99 

£*v 

.90 

.89 

.89 

FT4DH 

54 

Mean 

11.89 

16.17 

SD 

9.53 

10.89 

—XV 

.89 

.89 

Anchor  Test 

FE2PA 

20 

Mean 

8.46 

8.42 

7.25 

8.02 

9.72 

SD 

4.15 

4.17 

3.97 

3.99 

4.07 

Correlations  are  betveen 

the  indicated 

anchor  test  score  and 

1  total 

teat 

score  FT2XX  in  the 

i  case 

of  a  test  being  equated 

to  itself. 

and  between 

anchor  test  score 

FE2PA 

and  the  indicated  total 

test  score 

in  the 

case  of 

a  test  being  equated  to 

a  different  test.  All 

anchor  test 

scores 

were 

included  in  the  total  test  score  except  FE2FM. 

“The  last  two  digits  of  the  sample  number  refer  to  the  total  test  and 
anchor  test  score  combination  used  for  a  particular  equating. 
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Table  6 

Formula  Score  Means,  Standard  Deviations,  and  Correlations*  Between 
Anchor  Teat  and  Total  Test  Scores  for  Base  Sample  No.  44  (November  1975) 


Samples1 

6 

No.  of 

Random 

Dissimilar 

Test 

Items 

441XX 

442XX 

447XX 

448XX 

449XX 

Equating  a  Test 
to  Itself 

Total  Test 

FT4XX 

85 

Mean 

36.34 

35.97 

36.70 

42.18 

SD 

15.77 

14.92 

14.74 

15.10 

Anchor  Test 

FE4DE 

36 

Mean 

21.82 

21.74 

22.15 

24.19 

SD 

7.13 

7.11 

6.76 

6.43 

Ixv 

.93 

.93 

.93 

.92 

FE4DM 

36 

Mean 

15.31 

15.23 

15.47 

17.81 

SD 

7.13 

6.85 

6.75 

6.89 

Ixv 

.95 

.95 

.94 

.95 

FE4DH 

36 

Mean 

9.19 

8.98 

9.12 

12.02 

SD 

7.85 

7.29 

7.56 

8.06 

Ixv 

.94 

.93 

.93 

.95 

FE4FG 

40 

Mean 

14.38 

14.15 

14.57 

17.22 

SD 

6.32 

8.02 

7.80 

8.24 

Ixv 

.87 

.86 

.85 

.87 

Equating  a  Test  to 

a  Different  Test 

Total  Test 

FT2DE 

54 

Mean 

31.88 

28.61 

SD 

10.05 

10.25 

Ixv 

.89 

.87 

FT2DM 

54 

Mean 

21.99 

22.07 

20.85 

SD 

10.95 

10.92 

10.38 

Ixv 

.91 

.90 

.89 

FT2DH 

54 

Mean 

12.94 

16.47 

SD 

10.21 

10.45 

£xv 

.89 

.89 

Anchor  Test 

FE4PA 

20 

Mean 

8.12 

7.93 

6.66 

8.08 

9.48 

SD 

4.36 

4.13 

3.89 

4.07 

4.39 

aCorrelatlons  are 

between  the 

indicated 

anchor 

test  score 

and  total  test 

score  FT4XX  In  the  case  of  a  test  being  equated  to  itself,  and  between 

anchor  test  score  FE4PA  and  the  indicated  total  test  score  in  the  case  of 

a  test  being  equated  to  a  different  test.  All  anchor  test  scores  were 
Included  in  the  total  test  score  except  FE4FG. 

“The  last  two  digits  of  the  sample  number  refer  to  the  total  test  and 

anchor  test  score  combination  used  for  a  particular  equating. 


-  180  - 


To  determine  the  ICC  equipercentile  criterion,  the  3-parameter  logistic 
test  model  was  applied  separately  to  Reading  (reading  comprehension  and  sentence 
completion)  and  Vocabulary  (antonym  and  analogy)  items,  so  that  unidimensional¬ 
ity  did  not  have  to  be  assumed  across  all  item  types.  The  following  scores  were 
equated: 


Base  Sample  No.  32 
New  Form  Old  Form 


Base  Sample  No.  44 
New  Form  Old  Form 


FT2DE  FT2DM 
FT2DE  FT2DH 
FT2DM  FT2DH 


FTADE  FT4DM 
FT4DE  FT4DH 
FT4DM  FT4DH 


The  following  steps  were  required  to  accomplish  the  3-parameter  logistic 
ICC  criterion  equatings: 


1.  LOGIST  was  used  to  calculate  item  parameter  estimates  and  examinee 
ability  estimates  separately  for  each  set  of  relatively  homogeneous 
items  (Reading  and  Vocabulary). 


2.  For  each  of  the  three  scores,  for  each  examinee,  an  estimated  true  raw 
(R«W/4)  score  was  calculated  across  the  item  types  represented  in  each 
score.  The  true  raw  score  (Ra)  was  estimated  as  follows: 


*a  "  2.Pi<9;a>  ~  [^<e;a>]/4 


[3] 


where^ 

0ia  is  the  estimated  ability  of  examinee  a_  on  the  item  type  repre¬ 
sented  by  item  i^  (if  item  i  is  a  Reading  item,  then  9^a  is 
the  estimate  of  the  examinee's  reading  ability,  and  so 
forth) ; 

is  the  probability  of  examinee  a  answering  item  i.  correctly 
(as  calculated  frorf  the  3-parameter  logistic  test  model); 

=  1  -  Pjj  and 

1/4  is  the  correction  factor  for  guessing  for  five-choice  multi¬ 
ple-choice  items. 


(Each  summation  was  over  only  those  items  to  which  examinee  a  actually 
responded.) 


3.  For  each  of  the  three  pairs  of  scores,  the  scores  were  directly  equated 
by  the  equipercentile  method.  Raw  and  scaled  score  equivalents  were 
generated  for  each  integral  score  on  the  new  form  for  which  it  was  pos¬ 
sible  to  establish  a  conversion.  (The  true  raw  scores  did  not  usually 
extend  over  the  possible  score  range,  thus  making  it  impossible  to  es¬ 
tablish  a  conversion  for  some  scores.) 
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RESULTS  AND  DISCUSSION 
Comparisons  of  Equating  Models 

Tables  7  through  12  and  Figures  1  through  6  summarize  discrepancies  between 
the  results  of  the  experimental  equatings  and  the  criterion  equatings.  The  dis- 
crepancies  are  stated  as  mean  square  error  and  squared  bias.  To  make  the  re¬ 
sults  comparable  for  different  tests,  the  discrepancies  are  expressed  on  a  scale 
on  which  the  standard  deviation  of  the  criterion  scores  would  be  100  for  a  stan¬ 
dard  reference  group.  The  discrepancy  indices  for  Tables  9  and  11  and  Figures  3 
and  5  were  calculated  in  relation  to  criterion  equatings  in  which  the  3-parame¬ 
ter  logistic  test  model  was  used  (the  ICC  equipercentile  criterion).  The  dis¬ 
crepancy  indices  for  Tables  10  and  12  and  Figures  4  and  6  were  calculated  in 
relation  to  criterion  equatings  in  which  the  equipercentile  equating  method  was 
used  (the  direct  equipercentile  criterion). 

The  test  variations  represented  in  Tables  7  through  12  are  characterized  in 
Tables  2  and  3.  The  statistical  characteristics  of  the  tests  on  which  the  total 
test  scores  and  the  anchor  test  scores  were  based  are  described  in  Table  4.  The 
first  three  digits  of  the  numeric  codes  in  the  column  labeled  "Sample  Number" 
identify  the  subsamples  on  which  the  equatings  were  based  (see  Tables  5  and  6 
for  sample  statistics). 

The  column  labeled  "Best  Lin"  identifies  the  linear  model  that  had  the 
smallest  mean  square  error  under  the  condition  specified.  The  linear  models 
considered  are  enumerated  in  Sections  I  (observed  score  models)  and  III  (true 
score  models)  of  Table  1,  which  also  gives  the  identification  codes  for  all  mod¬ 
els.  The  models  identified  in  Tables  7  through  12  as  Equi%  (Dir),  Equi%  (FE), 
3-Par  ICC,  and  1-Par  ICC  correspond,  respectively,  to  Sections  IIB,  IIA,  IVA2, 
and  IVA1  of  Table  1. 

Figures  1  through  6  correspond  sequentially  to  Tables  7  through  12.  It  is 
essential  to  note  that  the  size  of  the  scale  units  represented  on  the  y-axis  is 
different  in  different  figures.  The  scale  on  the  left  side  of  each  figure  is  in 
terms  of  standard  deviation  units  and  is  the  square  root  of  the  linear  scale  of 
mean  square  error  shown  on  the  right  side  of  the  figure.  Contiguous  bars  repre¬ 
sent  the  results  for  one  model  or,  In  the  case  of  "Best  Lin,"  for  the  best  of  a 
set  of  models.  The  results  for  random  samples  are  shown  in  the  top  half  of  the 
figure;  and  the  results  for  dissimilar  samples,  in  the  bottom  half.  The  outer 
bars  represent  total  error;  and  the  inner  shaded  bars,  bias. 

The  bars  in  each  set,  from  left  to  right,  appear  in  the  successive  rows  in 
the  comparable  area  of  the  corresponding  table.  For  example,  the  first  bar  in 
each  set  in  Figure  1  represents  the  results  obtained  with  an  internal  anchor 
test  for  a  pair  of  April  1975  samples,  which  were  either  random  or  of  middle  and 
high  ability,  respectively;  the  second  bar,  an  external  anchor  test  for  the  same 
April  samples;  the  third  bar,  an  internal  anchor  test  for  a  pair  of  November 
1975  samples;  and  the  fourth  bar,  an  external  anchor  test  for  the  same  November 
samples. 
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Equating  a  Test  to  Itself 

Test  Variations  1  and  4.  The  data  In  Table  7  (and  Figure  1)  show  that  for 
both  similar  and  dissimilar  samples  the  best  linear  model  had  the  smallest 
amount  of  total  error  across  replications,  followed  by  the  1-parameter  ICC  and 


Figure  1 

Comparisons  of  Equating  Models:  Equating  a  Test 
to  Itself  Through  a  Medium-Difficulty  Anchor  Test 


Best  EquiX  EquiX  3-Par  1-Par 

LIN  (DIR)  (PE)  ICC  ICC 


3-parameter  ICC  models.  The  direct  equipercentile  and  frequency  estimation 
equipercentile  methods  had  relatively  more  error.  Surprisingly,  the  total  error 
for  dissimilar  samples  was  very  similar  to  the  total  error  for  random  samples, 
implying,  perhaps,  that  if  the  anchor  test  is  nearly  parallel  to  the  total 
tests,  the  differences  between  samples  is  not  so  Important.  The  only  method 
showing  noticeably  more  error  for  dissimilar  samples  was  the  frequency  estima¬ 
tion  equipercentile  method.  In  this  instance,  bias  accounted  for  a  large  pro¬ 
portion  of  the  error.  For  dissimilar  samples,  the  equatings  through  an  external 
anchor  had  noticeably  more  error  for  all  but  the  1-parameter  ICC  model. 

Test  Variation  5.  For  random  samples  the  total  errors  for  Test  Variation  5 
(Table  8  and  Figure  2)  were  similar  to  the  total  errors  for  Test  Variations  1 


A~ 


and  4.  The  fact  that  the  total  error  for  the  best  linear  model  was  small  for 
random  samples  suggests  that  the  curvilinear  relation  of  the  anchor  test  and  the 
total  test  had  little  effect  on  the  equating. 


The  total  error  for  the  best  linear  model  was  substantially  larger  when 
dissimilar  samples  were  used.  The  ICC  models  were  noticeably  superior  to  the 
other  models  in  this  case,  suggesting  that  these  models  are  relatively  robust 
when  the  anchor  test  is  different  in  difficulty  from  the  total  test  and  the 
samples  differ  in  ability. 


Figure  2 

Comparisons  of  Equating  Models:  Equating  a  Test 
to  Itself  Through  an  Easy  or  Difficult  Anchor  Test 
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Equating  a  Test  to  a  Different  Test 

Test  Variations  8  and  9.  The  introduction  of  some  curvillnearlty  in  the 
relation  between  the  scores  of  the  tests  being  equated  (due  to  differences  in 
difficulty)  resulted,  as  expected,  in  a  very  large  total  error  for  the  best  lin¬ 
ear  model  (see  Tables  9  through  12  and  Figures  3  to  6).  For  the  most  part,  sam¬ 
ple  variation  seemed  to  have  little  effect  on  total  error  for  the  curvilinear 
models.  All  of  the  curvilinear  models  had  noticeably  less  error  than  the  best 
linear  model,  but  the  1-parameter  ICC  model  had  substantially  more  error  than 


the  other  curvilinear  models.  For  most  equatings,  the  model  with  the  smallest 
total  error  was  the  3-parameter  ICC  model.  For  the  most  part,  when  greater  dif¬ 
ferences  in  total  test  difficulty  were  introduced  (greater  curv il inear i ty ) ,  the 
total  error  tended  to  increase  substantially  for  all  models,  becoming  exceeding¬ 
ly  large  for  the  best  linear  model. 


Figure  3 

Comparisons  of  Equating  Models:  Equating  a  Test  to 
a  Different  Test  Using  Easy  and  Medium  or  Medium  and 
Difficult  Tests  (ICC  Equipercentile  Criterion) 


Criterion  Bias 


Equating  a  Test  to  a  Different  Test 

The  total  error  and  bias  for  the  two  criteria  for  Test  Variations  8  and  9 
can  be  compared  in  Figures  3  and  4  (Tables  9  and  10)  and  in  Figures  5  and  6  (Ta¬ 
bles  11  and  12).  It  may  be  noted  that  for  random  samples  the  experimental  equi¬ 
percentile  equatings  had  less  error  than  the  3-parameter  ICC  model  when  the 
equipercentile  observed-score  criterion  was  used,  suggesting  that  the  equiper¬ 
centile  criterion  equatings  based  on  true  scores  may  be  biased  in  favor  of  the 
3-parameter  ICC  model,  just  as  the  equipercentile  criterion  equatings  based  on 
observed  scores  seem  to  be  biased  in  favor  of  the  equipercentile  models.  Inter- 
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estingly  enough,  the  3-parameter  ICC  model  has  less  total  error  than  the  1-pa¬ 
rameter  ICC  model  regardless  of  which  criterion  was  used,  although  the  differ¬ 
ence  between  the  models  was  less  for  the  direct  equipercentile  criterion. 


The  criterion  bias  was  less  obvious  in  the  case  of  dissimilar  samples;  in 
fact,  the  rank  ordering  of  the  models  in  terms  of  total  error  was  not  affected 
by  the  choice  of  criterion.  The  3-parameter  ICC  model  had  the  smallest  error 
under  both  criteria,  but  the  size  of  the  error  decreased  for  the  equipercentile 
and  1-parameter  ICC  models  and  increased  for  the  3-parameter  ICC  model  when  the 
direct  equipercentile  criterion  was  used. 


Figure  A 

Comparisons  of  Equating  Models:  Equating  a  Test  to  a 
Different  Test  Using  Easy  and  Medium  or  Medium  and 
Difficult  Total  Tests  (Direct  Equipercentile  Criterion) 
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Equating  a  Test  to  Itself 


The  criterion  would  seem  to  be  well  established  when  a  test  is  equated  to 
itself.  However,  it  is  possible  that  the  criterion  in  this  case  is  biased  in 
favor  of  a  model  that  comes  closest  to  fixing  all  of  the  item  parameters  at  the 
same  values,  in  this  case,  the  1-parameter  ICC  model. 
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It  is  easily  seen  that  if  the  ji's,  _b's,  and  £'s  in  the  3-parameter  logistic 
test  model  are  fixed  at  some  constant  values,  the  true  raw  scores  for  the  old 
and  new  forms  corresponding  to  a  given  ability  level  have  to  be  the  same,  since 
the  probability  of  getting  a  particular  item  correct  is  then  a  function  of  only 
the  item  parameters  and  ability. 

In  the  1-parameter  logistic  test  model,  £  is  fixed  at  0  and  £  at  1  for  all 
items.  Since  only  the  t>'s  are  estimated,  it  might  be  expected  that  the  1-param- 
eter  ICC  model  is  more  likely  than  the  3-parameter  ICC  model  to  yield  the  appro¬ 
priate  conversions.  Is  it  of  any  consequence,  then,  if  it  is  found  that  the 
1-parameter  ICC  model  seems  to  be  superior  to  the  3— parameter  ICC  model  in 
equating  a  test  to  itself?  What  is  really  desired  is  to  know  which  of  the  models 
works  best  when  a  test  is  equated  to  a  different  *'.?st,  particularly  when  a  test 
is  equated  to  a  parallel  test.  It  would  seem,  then,  that  there  may  be  a  natural 
bias  in  favor  of  the  1-parameter  ICC  model  when  a  test  is  equated  to  itself,  but 
this  probably  does  not  affect  the  other  models. 

Figure  5 

Comparisons  of  Equating  Models:  Equating  a 
Test  to  a  Different  Test  Using  Easy  and  Difficult 
Total  Test  (ICC  Equipercentile  Criterion) 
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Figure  6 

Comparisons  of  Equating  Models:  Equating  a 
Test  to  a  Different  Test  Using  Easy  and  Difficult 
Total  Test  (Direct  Equipercentile  Criterion) 


CONCLUSIONS 

Several  conclusions  can  be  drawn  from  the  study.  They  should  be  considered 
tentative  because  of  possible  criterion  bias,  which  has  been  only  partially  con¬ 
trolled;  because  it  was  not  possible  to  study  all  test  variations  in  the  full 
design,  in  particular,  the  variations  in  which  parallel,  but  not  identical, 
tests  were  equated;  and  because  the  results  showed  occasional  inconsistencies 
that  have  not  yet  been  explained: 

1.  When  a  test  is  equated  to  itself  (or,  to  generalize,  to  a  test  like 
itself)  through  a  parallel  anchor  test,  a  linear  model  yields  very  good 
results  regardless  of  the  type  of  samples  used.  However,  whether  any 
particular  linear  model  consistently  gives  satisfactory  results  re¬ 
quires  further  study.  The  best  linear  model  for  a  particular  test 
variation  and  type  of  sample  was  used  here. 

2.  Curvilinear  models  give  results  nearly  comparable  to  those  of  the  best 
linear  model  when  a  test  is  equated  to  a  test  like  itself  through  an 
internal  anchor.  When  an  external  anchor  is  used,  of  the  curvilinear 
models  the  ICC  models  (particularly  the  1-parameter  model)  give  rela- 


Lively  better  results.  However,  the  criterion  may  be  biased  in  favor 
of  the  1-parameter  model. 

3.  The  types  of  samples  have  a  relatively  small  and  unsystematic  effect  on 
the  quality  of  the  equating  results  if  the  anchor  test  is  similar  in 
content  and  in  difficulty  to  the  total  tests.  The  one  exception  is  the 
frequency  estimation  procedure,  which  seems  not  to  perform  well  when  an 
external  anchor  and  dissimilar  samples  are  used. 

4.  The  equatings  involving  an  internal  anchor  have  less  total  error  than 
comparable  equatings  with  an  external  anchor.  Whether  or  not  this  in¬ 
consistency  would  obtain  when  a  test  is  equated  to  a  test  of  different 
difficulty  needs  to  be  studied.  The  possibility  that  the  criterion  is 
biased  in  favor  of  an  internal  anchor  also  needs  investigation.  Howev¬ 
er,  it  may  simply  be  due  to  the  fact  that  the  external  anchor  tests 
were  not  quite  as  similar  to  the  total  tests  in  content  and  statistical 
characteristics  as  were  the  internal  anchor  tests. 

5.  When  a  test  is  equated  to  a  test  like  itself  through  an  easy  or  hard 
anchor  test  with  random  samples,  all  of  the  models  have  a  small  mean 
square  error.  When  dissimilar  samples  are  used,  however,  the  ICC  mod¬ 
els  give  clearly  superior  results. 

6.  When  total  tests  differ  considerably  in  difficulty,  linear  models  yield 
unsatisfactory  results  in  that  the  mean  square  error  becomes  very 
large;  but  they  tend  to  yield  better  estimates  of  the  mean  than  the 
curvilinear  models.  The  1-parameter  ICC  model  and  the  frequency  esti¬ 
mation  method  also  give  unacceptable  results  in  many  instances.  This 
result  is  consistent  with  the  Slinde  and  Linn  (1978)  findings  that  the 
Rasch  model  yielded  poor  results  for  vertical  equating. 

7.  The  3-parameter  ICC  model  is  the  best  equating  model  when  total  tests 
of  unequal  difficulty  are  equated  through  a  medium-difficulty  anchor 
test  with  dissimilar  samples.  It  would  appear  that  the  3-parameter 
model  is  the  equating  model  most  likely  to  yield  acceptable  results 
under  unusual  or  extreme  conditions. 

This  study,  though  comprehensive  in  its  coverage,  is  limited  in  that  SAT-V 
items  made  up  the  item  pool.  These  items  are  known  to  be  relatively  homogeneous 
and  somewhat  difficult  for  the  current  test-taking  population.  Similar  studies 
are  needed  in  situations  where  the  content  of  the  anchor  test  is  allowed  to  de¬ 
part  by  varying  degrees  from  the  content  of  the  total  tests  and  where  the  test¬ 
taking  population  is  from  a  different  age  group. 
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The  Effects  of  Context  on  Latent-Trait  Model  Item 
Parameter  and  Trait  Estimates 


Wendy  M.  Yen 
CTB/McGraw  Hill 


Latent  trait  models  hold  the  promise  of  being  particularly  useful  in  the 
development  of  test  item  pools.  Items  for  an  item  pool  can  be  accumulated  by 
administering  different  sets  of  items  to  different  groups  of  examinees;  all  the 
items'  parameters  can  then  be  linked  to  the  same  scale  through  a  common  subset 
of  items  called  anchor  or  linking  items.  After  an  item  pool  is  created,  differ¬ 
ent  examinees  can  take  different  items  from  the  pool,  and  all  examinees'  trait 
estimates  should  be  on  a  common  scale.  An  examinee's  trait  estimate  should  not 
be  systematically  affected  by  the  choice  of  items,  although  unsystematic  ef¬ 
fects,  as  reflected  in  the  standard  error  of  measurement,  can  occur. 

The  successful  use  of  latent  trait  models  in  the  development  and  use  of 
item  pools  relies  on  (1)  the  lack  of  systematic  effects  on  item  parameter  esti¬ 
mates  when  they  are  obtained  in  different  contexts  with  different  examinees  and 
(2)  the  lack  of  systematic  effects  on  trait  estimates  when  they  are  obtained 
with  different  items.  Previous  research  on  the  invariance  of  item  parameter 
estimates  has  typically  examined  the  effects  of  changing  samples  on  a  fixed 
group  of  items  and  has  not  examined  variations  in  the  identities  and  arrange¬ 
ments  of  the  items.  The  present  research  examines  such  variations,  as  well  as 
the  effects  of  changes  in  items  on  trait  estimates.  The  effects  of  the  choice 
of  latent  trait  model  on  parameter  stability  and  trait  equating  are  also  exam¬ 
ined  . 


Method 


Construction  of  Test  Booklets 


Items  were  chosen  from  the  California  Achievement  Tests,  Forms  C  and  D 
(1977),  Level  14,  Reading  Comprehension  (Reading — 80  items),  and  Level  16,  Math- 
ematics  Concepts  and  Applications  (Mathematics — 90  items).  The  Reading  items 
all  had  four-answer  choices  and  the  Mathematics  items  all  had  five-answer 
choices.  Preliminary  analyses  were  performed  on  data  for  students  who  took  both 
test  forms  (N  =  294  for  Reading,  N  =  379  for  Mathematics).  Chi-square  goodness- 
of-fit  statistics  were  used  to  evaluate  the  items  in  terms  of  their  fit  to  a 
2-parameter  logistic  model;  item  difficulties  and  discriminations  were  reviewed 
to  identify  items  with  extreme  difficulties  or  discriminations. 

Using  these  results,  five  sets  of  items  were  created  in  each  content  area. 
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The  first  set  of  items  (Set  A)  had  a  range  of  difficulties  and  relatively  good 
model  fit;  these  items  were  used  as  anchor  items  for  linking  item  parameters 
obtained  in  different  booklets.  The  second  and  third  sets  of  items  (Sets  V  and 
W)  had  relatively  poor  fit  and,  in  some  cases,  extreme  difficulties  or  low  dis¬ 
criminations;  these  items  were  included  to  alter  item  contexts.  The  fourth  and 
fifth  sets  of  items  (Sets  X  and  Y)  had  relatively  good  model  fit  and  discrimina¬ 
tion  and  nonextreme  difficulties,  and  were  the  items  of  major  interest.  Table  1 
contains  the  number  of  items  chosen  for  each  set  for  each  content  area. 


Table  1 

Number  of  Items  in  Each  Set 


Set 

Number 

of  Items 

Reading 

Mathematics 

A 

10 

11 

V 

10 

11 

W 

10 

11 

X 

20 

22 

Y 

20 

22 

Using  these  sets  of  items,  seven  booklets  were  created,  as  described  in 
Table  2.  The  items  in  the  different  sets  were  intermingled  within  the  booklets, 
and  the  sequences  of  items  were  varied  over  booklets.  Because  of  the  connection 
of  Reading  items  to  passages  and  the  connection  of  some  Mathematics  items  to 
graphs,  there  was  necessarily  some  similarity  over  booklets  in  the  local  con¬ 
texts  for  some  items.  The  sequence  of  answer  choices  for  an  item  was  held  con¬ 
stant  over  all  booklets. 


Table  2 

Composition  of  Test  Booklets 


Booklet 

Sets 

Number 

Reading 

of  Items 
Mathematics 

1 

X+Y 

40 

44 

2 

A+V+X 

40 

44 

3 

A+W+Y 

40 

44 

4 

A+X 

30 

33 

5 

A+Y 

30 

33 

6 

A+X 

30 

33 

7 

A+Y 

30 

33 

To  indicate  the  degree  cf  similarity  of  the  sequence  of  X  and  Y  items 
across  booklets,  the  sequential  positions  of  the  X  items  and  of  the  Y  items  t-re 
determined.  Spearman  rank-order  correlations  between  items'  positions  in  Book¬ 
let  1  and  their  positions  in  the  other  booklets  are  contained  in  Table  3. 
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Table  3 


Spearman  Rank-Order  Correlations  Between  the  Position 
of  X  or  Y  Items  in  Booklet  1  and  the  Position 
of  Those  Items  in  Booklets  2  to  7 


Booklet 

Set 

Correlation 

Reading 

with  Booklet  1 
Mathematics 

2 

X 

.90 

.05 

3 

Y 

.56 

.40 

4 

X 

.59 

-.06 

5 

Y 

.70 

.62 

6 

X 

.10 

-.06 

7 

Y 

.30 

.03 

Note.  Each 

correlation 

is  based  on  20  items 

for  Reading 

and  22  items  for  Mathematics. 
Test  Administration 


Students  were  tested  in  Grade  4  for  Reading  and  in  Grade  6  for  Mathematics. 
Time  limits  for  test  administration  were  adjusted  for  the  length  of  the  booklet 
and  were  made  comparable  (on  a  time-per-item  basis)  to  those  given  in  the  Cali¬ 
fornia  Achievement  Tests  Examiner's  Manual,  Levels  14-19,  Forms  C  and  D  ( 1977 j . 
For  the  first  testing  each  student  took  one  of  the  seven  booklets.  Booklets  2 
and  3  (as  well  as  Booklets  4  and  5  and  Booklets  6  and  7)  were  administered  to 
students  in  the  same  classrooms  on  an  alternate-seat  basis.  (This  alternate- 
seat  testing  was  done  for  a  study  of  equipercentile  equating  that  will  not  be 
reported  here.)  Two  weeks  later,  all  students  took  Booklet  1.  The  number  of 
examinees  with  usable  answer  sheets  for  both  first  and  second  testings  appear  in 
Table  4. 


Table  4 

Number  of  Examinees  with  Usable  Answer 
Sheets  for  Both  First  and  Second  Testings 


First-Testing 

Booklet* 

Reading 
(Grade  4) 

Mathematics 
(Grade  6) 

1 

470 

450 

2 

225 

228 

3 

216 

232 

4 

193 

230 

5 

198 

219 

6 

186 

232 

7 

190 

221 

Note .  Booklets  2  and  3,  4  and  5,  and  6 
and  7  were  administered  on  an 
alternate-seat  basis. 

*A11  second  testings  used  Booklet  1. 
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Latent  Trait  Models 


Two  latent  trait  models  were  used — the  3-parameter  logistic  model  and  the 
1-parameter  (Rasch)  logistic  model  (see  Allen  &  Yen,  1979,  for  a  further  de¬ 
scription  of  these  models).  For  the  3-parameter  logistic  model,  the  item  char¬ 
acteristic  function  for  the  ith  item  and  the  k*-*1  examinee  is 


P^eo  =  o.  + 


1  " 


1  +  e'1'lai(-6k 


T^T 


[1] 


where  0^  is  the  latent  trait  value  for  examinee  k_  and  a^,  b_£,  and  c_£  are  the 
discriminati"'.  power,  difficulty,  and  lower  asymptote,  respectively,  for  item  i_. 
The  item  cha>  cteristic  function  for  the  1-parameter  model  is 


Pt(V 


1  +  -  V 


[2] 


where  a  is  the  item  discrimination  power  common  to  all  the  items. 

To  obtain  latent  trait  and  item  parameter  estimates,  examinees'  item  re¬ 
sponse  vectors  were  analyzed  using  the  LOGIST  computer  program  provided  by  Wood, 
Wingersky,  and  Lord  (1976).  Trait  estimates  were  not  obtained  for  examinees  who 
had  zero  or  perfect  scores,  and  using  a  default  option  of  the  Wood  et  al .  (1976) 
program,  trait  estimates  also  were  not  obtained  for  examinees  who  did  not  answer 
at  least  a  third  of  the  items  being  scaled. 


Item  Linking 

The  item  parameters  for  the  X  and  Y  item  subsets  were  placed  on  the  same 
scale  using  different  procedures  for  Booklet  1  and  Booklets  2  to  7.  For  Booklet 
1,  examinees  took  both  X  and  Y  items;  item  response  vectors  for  the  two  sets 
were  analyzed  together,  and  their  item  parameters  were  automatically  placed  on 
the  same  scale.  This  procedure  was  followed  whether  Booklet  1  was  administered 
in  the  first  testing  or  in  the  second  testing.  In  some  analyses,  those  exam¬ 
inees  who  took  Booklet  1  at  the  first  testing  were  divided  into  two  approximate¬ 
ly  equal-sized  groups,  and  item  parameters  were  obtained  separately  in  the  two 
groups.  For  convenience,  these  groups  were  labeled  Booklets  1A  and  IB,  even 
though  the  1A  and  IB  test  booklets  were  the  same;  "A"  and  "B"  merely  indicated 
different  samples  of  examinees. 

For  Booklets  2  to  7,  booklets  were  linked  in  pairs:  2  and  3,  4  and  5,  6  and 
7.  For  example,  the  responses  to  all  the  items  in  Booklets  2  and  3  were  pooled 
and  analyzed  jointly.  This  was  done  by  treating  the  items  in  the  A,  V,  W,  X, 
and  Y  sets  as  if  they  were  all  contained  in  one  test  booklet.  Examinees'  re¬ 
sponses  were  used  for  all  the  items  they  completed,  and  they  were  given  a  "not 
reached"  code  for  every  item  they  did  not  take.  Examinees  who  took  Booklet  2 
were  given  "not  reached"  for  items  in  Sets  W  and  Y,  and  those  who  took  Booklet  3 
were  given  "not  reached"  for  items  in  Sets  V  and  X.  Using  the  LOGIST  program, 
an  examinee's  trait  value  was  based  only  on  the  items  the  examinee  actually 
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took;  not  reached  items  were  ignored.  Similarly,  an  item's  parameters  were  es¬ 
timated  using  only  the  responses  of  examinees  who  actually  completed  the  item. 
This  joint  analysis  of  Booklets  2  and  3  placed  the  parameters  for  the  A,  V,  W, 

X,  and  Y  items  on  the  same  scale. 

For  some  analyses,  responses  for  Booklets  2  to  7  were  pooled  and  jointly 
analyzed.  Thus,  responses  to  Booklets  2,  4,  and  6  entered  into  the  estimation 
of  the  item  parameters  for  the  X  subset,  responses  to  Booklets  3,  5,  and  7  en¬ 
tered  into  the  estimation  of  the  item  parameters  for  the  Y  subset;  and  responses 
to  all  six  booklets  entered  into  the  estimation  of  the  item  parameters  for  the  A 
subset  of  items. 

Analysis  of  Context  Effects 


To  examine  context  effects  on  the  means  and  standard  deviations  of  the  item 
parameters  from  the  first  testing,  it  was  necessary  to  place  the  item  parameters 
estimated  from  different  samples  and  booklets  on  the  same  scale.  To  do  this, 
the  item  parameters  obtained  from  the  pooling  of  Booklets  2,  4,  6  and  3,  5,  7 
were  scaled  so  that  the  corresponding  trait  estimates  had  a  mean  of  0  and  a 
standard  deviation  of  1,  and  mean  item  difficulty  and  mean  item  discriminations 
were  obtained  from  the  Set  A  items.  The  item  parameters  from  the  other  pairs  of 
first-testing  booklets  that  contained  Set  A  items  (i.e..  Booklets  2  and  3,  4  and 
5,  6  and  7)  were  linearly  transformed  so  that  their  Set  A  mean  item  difficulties 
and  mean  discriminating  powers  equaled  the  Set  A  means  obtained  from  the  pooled 
2,  4,  6  and  3,  5,  7  booklets.  This  transformation  theoretically  placed  all  the 
first-testing  item  parameters  on  the  same  scale — a  scale  that  produced  trait 
estimates  with  a  mean  of  0  and  a  standard  deviation  of  1  for  examinees  who  took 
Booklets  2  to  7.  The  X  and  Y  item  parameters  could  then  be  compared  across 
booklets  to  examine  systematic  context  effects.  (Note  that  this  comparison 
would  not  be  made  for  Booklet  1,  which  did  not  contain  Set  A  items.) 

The  square  root  of  the  mean  square  difference  (RMSD)  between  two  sets  of 
estimated  statistics  (e.g.,  item  parameters  or  trait  values)  was  found  as 

= 

where  and  represent  statistics  in  sets  m  and  m'  and  there  are  n  statis¬ 

tics  being  compared.  The  RMSD  also  can  be  expressed  as 
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s  and  s  i  are  the  standard  deviations  of  the  statistics 
— m  m 

in  the  two  sets, 

z  and  z  ,  are  the  means  of  the  statistics  in  the  two 
— m  — m 

sets,  and 

r  t  is  the  correlation  between  the  two  sets  of  statistics. 
— mm 


where 
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Chi-square  goodness-of-f it  statistics  were  calculated  in  the  following 
fashion.  Examinees  were  rank  ordered  on  the  basis  of  their  trait  estimates  and 
then  divided  into  10  cells  with  approximately  equal  numbers  of  examinees  in  each 
cell.  The  chi-square  for  an  item  was 


X2 


10 

l 

J-l 


N  .  (  0  . .  -  E  .  .  r 

•  3  1 3  U 

E  .  1  -  E.  .) 


15) 


where 

Nj  was  the  number  of  examinees  in  cell  j_, 

O^j  was  the  observed  proportion  of  examinees  in  cell  that 
passed  item  i,  and 

E^j  was  the  proportion  of  examinees  in  cell  j_  expected  to  pass 
i t  em  i . 

Eio  =  n-  X  him, 

d  feecell  3 


[6] 


where  ^(0^)  was  the  item  characteristic  function  evaluated  using  the  trait  es¬ 
timate  for  examinee  k  and  the  estimated  item  parameters  for  item  i.  When  item 
parameters  were  estimated  from  the  data  on  which  the  chi-square  is  based,  this 
chi-square  statistic  had  10  -  3  =  7  degrees  of  freedom  for  the  3-parameter  mod¬ 
el  and  10  -  1  =  9  degrees  of  freedom  for  the  1-parameter  model.1 


Results 


Item  Parameter  Estimates 


Table  5  contains  the  number  of  examinees  whose  responses  entered  into  the 
first-testing  item  parameter  estimations  for  the  various  booklets.  These  sample 
sizes  usually  were  slightly  smaller  than  those  in  Table  4  because  examinees  were 
not  used  if  they  did  not  answer  at  least  a  third  of  the  items  or  if  they  had 
zero  or  perfect  scores. 

First  testing.  For  the  3-parameter  model  the  lower  asymptotes,  c,  had  ho¬ 
mogeneous  values  centered  at  .20  for  Reading  and  .15  for  Mathematics.  Correla¬ 
tions  of  the  X  +  Y  difficulty  and  discrimination  parameters  estimated  in  the 
different  first-testing  booklets  are  contained  in  Table  6/  Recall  that  Booklets 
1A,  IB,  and  1  had  the  same  context  and  differed  only  in  terms  of  the  samples  and 
sample  sizes  used  to  estimate  fheir  parameters.  The  correlations  between  Book- 

1 A  simulation  study  found  that  when  40  item  responses  for  500  pseudo-examinees 
were  generated  for  a  model  and  these  responses  were  used  to  estimate  traits 
and  item  parameters  for  that  model,  the  resulting  chi-squares  had  expectations 
approximately  equal  to  their  degrees  of  freedom. 

2  .  .  ,  .  . 

Note  that  these  correlations  were  a  function  of  the  particular  items  being  cor¬ 
related,  and  the  correlations  cannot  be  meaningfully  compared  across  content 
areas. 
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Table  5 

Sample  Sizes  Involved  in  Parameter 
Estimations  for  the  First  Testing 


Booklets 

Sample 

Size 

Reading 

Mathematics 

1A 

232 

225 

IB 

228 

224 

1 

460 

449 

2&3 

223  +  214 

228  +  232 

46,5 

187  +  195 

228  +  218 

64,7 

183  +  184 

228  +  218 

2,4,66,3,5,7 

593  +  593 

684  +  668 

Note.  Some  of  these  sample  sizes  are  slight¬ 
ly  smaller  than  the  corresponding  sam¬ 
ple  sizes  in  Table  3  because  examinees 
were  excluded  if  they  had  zero  or  per¬ 
fect  scores  or  if  they  did  not  answer 
at  least  a  third  of  the  items. 

lets  1A  and  IB  can  be  compared  to  the  correlations  among  Booklets  2  and  3,  4  and 
5,  and  6  and  7  to  examine  the  degree  to  which  changes  in  context  affect  the  sta¬ 
bility  of  item  parameters  for  sample  sizes  of  about  200.  It  is  clear  from  the 
data  in  Table  6  that  a  change  in  context  substantially  decreased  the  stability 
of  all  the  item  parameter  estimates. 

First  versus  second  testing.  Another  comparison  of  the  correlations  be¬ 
tween  item  parameters  was  made.  Item  parameters  were  obtained  for  the  X  +  Y 
items  using  all  the  second-testing  data.  (Recall  that  all  the  second  testings 
used  the  same  booklet,  Booklet  1.)  These  parameter  estimates  were  based  on  rela¬ 
tively  large  sample  sizes  and  therefore  had  fairly  small  standard  errors.  Cor¬ 
relations  between  these  second-testing  parameter  estimates  and  the  first-testing 
parameter  estimates  are  contained  in  Table  7.  Correlations  involving  Booklets 
1A,  IB,  and  1  give  information  about  the  stability  of  parameters  over  a  constant 
context.  Correlations  involving  Booklets  2  to  7  give  information  about  the  sta¬ 
bility  of  parameters  over  a  varying  context. 

Booklets  1A,  IB,  2  and  3,  4  and  5,  and  6  and  7  all  had  item  parameters  es¬ 
timated  on  the  basis  of  about  200  examinees.  For  the  item  discriminations  the 
change  in  context  produced  a  consistent  reduction  in  the  correlations.  The 
change  in  context  also  decreased  the  stability  of  the  item  difficulties  for  both 
the  3-parameter  and  1-parameter  models.  By  examining  the  correlations  for  Book¬ 
lets  2,  4,  6  and  3,  5,  7,  it  can  be  seen  that  when  data  were  pooled  over  differ¬ 
ent  contexts,  and  sample  sizes  were  therefore  tripled,  there  was  an  increase  in 
the  strength  of  the  linear  relationship  between  first-  and  second-testing  item 
parameters.  However,  the  correlations  for  Booklets  2,  4,  6  and  3,  5,  7  (which 
involved  parameter  estimates  based  on  sample  sizes  of  about  600)  were  always 
lower  than  the  corresponding  correlations  for  Booklet  1  (which  involved  parame¬ 
ter  estimates  based  on  N  -  400),  and  usually  lower  than  the  correlations  for 
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Table  6 

Correlations  Between  First-Testing  Item  Parameter  Estimates 
for  40  X+Y  Reading  Items  (Lower  Triangles) 


and  44  X+Y  Mathematics 

Items 

(Upper 

Triangles) 

Model , 
Parameter , 
and  Booklet 

Booklets 

1A 

IB 

1 

243 

445 

647 

2,4, 

643, 

5,7 

3-Parameter 

Discr iminat ion 

1A 

.78 

.92 

.55 

.45 

.54 

.61 

IB 

.63 

.96 

.65 

.50 

.46 

.63 

1 

.90 

.88 

.64 

.48 

.51 

.64 

243 

.47 

.52 

.52 

.41 

.64 

.80 

445 

.39 

.28 

.40 

.31 

.47 

.77 

647 

.16 

.04 

.11 

.29 

.36 

.83 

2,4,643 

i,  5, 7  .40 

.31 

.39 

.67 

.74 

.67 

3-Parameter 

Difficulty 

1A 

.94 

.98 

.82 

.80 

.75 

.84 

IB 

.87 

.99 

.87 

.85 

.75 

.88 

1 

.96 

.97 

.87 

.85 

.78 

.89 

243 

.77 

.72 

.76 

.91 

.72 

.95 

4&5 

.83 

.73 

.80 

.69 

.76 

.96 

647 

.73 

.62 

.69 

.54 

.64 

.88 

2 , 4, 643 

1,5,7  .90 

.81 

.87 

.88 

.87 

.84 

1-Par arae ter 

Difficulty 

1A 

.98 

.99 

.88 

.86 

.85 

.90 

IB 

.95 

.99 

.89 

.88 

.87 

.92 

1 

.99 

.99 

.89 

.87 

.87 

.91 

243 

.76 

.77 

.78 

.95 

.83 

.97 

445 

.71 

.65 

.71 

.68 

.85 

.97 

647 

.65 

.63 

.65 

.55 

.68 

.93 

2,4,643,5,7  .81 

.81 

.82 

.84 

.88 

.88 

Booklets  1A  and  IB  (N  =  200).  For  Reading  the  3-parameter  and  1-parameter  mod¬ 
els  had  similar  stabilities  for  item  difficulties;  for  Mathematics  the  1-parame 
ter  model  produced  slightly  more  stable  difficulties.  It  is  not  clear  if  the 
difficulty  parameters  of  one  of  the  models  was  affected  more  by  changes  in  con¬ 
text  than  the  parameters  of  the  other  model. 

Effect  of  Linking 

The  X  and  Y  item  parameters  from  Booklets  2  to  7  were  linked  by  the  use  of 
the  anchor  items.  It  is  possible  that  there  were  inadequacies  in  the  linking 
procedure  that  caused  the  reduced  correlations  between  parameters  estimated  in 
these  booklets  and  those  in  Booklet  1.  Therefore,  a  check  on  the  importance  of 
the  linking  procedure  in  affecting  item  parameter  correlations  was  made.  Corre 
lations  were  computed  between  the  item  parameters  obtained  for  Set  X  using  the 
first-testing  booklets  and  the  parameters  obtained  for  Set  X  using  the  second- 
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Table  7 

Correlations  of  First-Testing  Item  Parameters 
with  Second-Testing  Item  Parameters 
for  40  X+Y  Reading  Items  and  44  X+Y  Mathematics  Items 


First-Testing 

Booklets 

Discrim- 
inat ion 

Difficulty 

3-Parameter  1-Parameter 

Reading 

1A 

.76 

.95 

.92 

IB 

.72 

.84 

.89 

1 

.81 

.92 

.92 

2&3 

.63 

.76 

.82 

465 

.60 

.81 

.81 

667 

.30 

.73 

.73 

2,4,663 ,5,7 

.67 

.89 

.90 

Mathemat ics 

1A 

.76 

.92 

.97 

IB 

.78 

.96 

.99 

1 

.80 

.96 

.99 

263 

.67 

.91 

.92 

465 

.59 

.91 

.92 

667 

.61 

.84 

.89 

2,4,663.5,7 

.76 

.95 

.95 

Note.  Sample  sizes  used  for  estimating  the  second-testing 
item  parameters  were  1,660  for  Reading  and  1,810 
for  Mathematics. 


testing  booklet,  Booklet  1;  analogous  correlations  were  obtained  for  Set  Y. 

There  was  no  linking  procedure  influencing  these  item  parameters.  These  corre¬ 
lations  are  contained  in  Table  8.  In  the  vast  majority  of  cases,  the  correla¬ 
tions  in  Table  9  fell  within  the  range  of  correlations  in  Table  10  produced  by 
the  X  items  and  the  Y  items  correlated  separately.  These  results  suggest  that 
the  linking  procedure  did  not  cause  the  reduction  in  item  parameter  correlations 
from  Booklet  1  to  Booklets  2  to  7. 

Item  Parameter  Statistics 


The  means  and  standard  deviations  of  the  first-testing  item  parameters  are 
contained  in  Table  9.  Differences  between  booklets  in  these  means  and  standard 
deviations  can  be  the  result  of  context  and/or  sampling  effects.  For  the  item 
discriminations  there  were  systematic  context/sampling  effects,  particularly  for 
Mathematics.  The  item  difficulties  displayed  substantial  context/sampling  ef¬ 
fects  for  Reading,  but  very  small  effects  for  Mathematics.  For  Reading  there 
were  larger  mean  differences  in  item  difficulties  for  the  3-parameter  model  than 
for  the  1-parameter  model.  Table  10  contains  the  RMSDs  between  the  item  parame¬ 
ters  estimated  with  the  various  first-testing  booklets.  For  the  item  difficul¬ 
ties  the  1-parameter  model  produced  smaller  RMSDs  than  the  3-parameter  model. 
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Table  8 

Correlations  of  First-Testing  X  or  Y  Item  Parameters 
with  Second-Testing  X  or  Y  Item  Parameters  for 
20  Reading  and  22  Mathematics  Items 


First-Testing 

Discrim- 

Dif ficu 

ity.  .  ... 

Booklets 

inat ion 

3-Parameter 

1-Parameter 

Reading 

X  Items 

1A 

.70 

.95 

.92 

IB 

.75 

.79 

.89 

1  1 

.79 

.90 

.92 

2 

.58 

.90 

.91 

4 

.60 

.84 

.85 

6 

.44 

.83 

.86 

2,4,6 

.70 

.95 

.96 

Y  Items 

1A 

.85 

.95 

.94 

i  IB 

.69 

.91 

.91 

1 

.85 

.95 

.94 

3 

.  66 

.78 

.86 

5 

.55 

.78 

.76 

i  7 

.17 

.63 

.58 

3,5,7 

.62 

.85 

.83 

Mathematics 

X  Items 

1A 

.72 

.90 

.98 

IB 

.65 

.97 

.99 

i  1 

.67 

.96 

.99 

,  2 

.67 

.91 

.93 

4 

.60 

.93 

.94 

1  6 

.67 

.90 

.94 

i  2,4,6 

.81 

.96 

.96 

Y  Items 

•1  1A 

.80 

.96 

.98 

IB 

.86 

.95 

.99 

!  1 

.89 

.97 

.99 

3 

.76 

.91 

.91 

>!  5 

.60 

.91 

.90 

j  7 

.61 

.77 

.81 

i  3’5’7  _ 

.78 

.95 

.93 

i 

t 

4 


Trait  Estimates 

Context  effects.  Using  the  second-testing  data,  trait  estimates  were  ob¬ 
tained  for  the  X  items  (0x)  and  the  Y  items  (0y)  for  all  examinees  who  answered 
at  least  a  third  of  the  X  and  a  third  of  the  Y  items  and  who  did  not  have  zero 
or  perfect  scores.  The  item  parameters  on  which  the  traits  were  based  were 
linked  in  fhe  first  testing,  as  previously  described  (see  Table  4  for  sample 
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Table  9 

Means  and  Standard  Deviations  of  First-Testing  Item  Parameters 
for  40  X+Y  Reading  Items  and  44  X+Y  Mathematics  Items 


Discrimination 


Difficulty 


3-Parameter 


1-Parameter 


Booklets 

Mean 

SD 

Mean 

SD  Mean 

SD 

Reading 

263 

.98 

.42 

-.15 

.93  -.50 

.56 

465 

1.11 

.54 

.43 

.71  -.14 

.56 

667 

.94 

.46 

-.14 

1.15  -.56 

.83 

2,4,663, 5,7 

.97 

.36 

.03 

.68  -.40 

.55 

Mathematics 

263 

.83 

.37 

-.10 

.86  -.43 

.86 

465 

.97 

.50 

-.10 

.95  -.42 

1.00 

667 

1.03 

.54 

-.09 

.88  -.36 

.91 

2,4,663,5,7 

.89 

.36 

-.08 

.80  -.39 

.87 

sizes  for  item  parameterization);  0  estimates  were  based  on  20  items  for  Reading 
and  22  items  for  Mathematics.  Because  the  item  parameters  were  linked  on  the 
basis  of  the  first-testing  data,  0^  and  0y  theoretically  were  equated.  Table  11 
presents  the  relationships  between  Q%  and  By.  For  item  parameters  estimated 
with  Booklet  1A,  the  1-parameter  model  produced  higher  correlations  between  0x 
and  0 y  than  the  3-parameter  model.  However,  the  3-parameter  model  produced 
closer  equatings  of  means  and  standard  deviations  than  the  1-parameter  model. 

The  RMSD/Sq  between  @x  and  0y  was  greater  for  the  3-parameter  than  for  the  1-pa¬ 
rameter  model. 


Table  10 

Root  Mean  Squared  Differences  Between  First-Testing  Item  Parameters 
for  40  X+Y  Reading  Items  and  44  X+Y  Mathematics  Items 


Booklets 

Discrimination: 

Booklets 

Difficulty 

3 

-Parameter: 

Booklets 

1- 

-Parameter : 
Booklets 

465 

667 

2,4, 

663, 

5,7 

465 

667 

2,4, 

663, 

5,7 

465 

667 

2,4, 

663, 

5,7 

Reading 

263 

.59 

.52 

.32 

.90 

1.01 

.49 

.58 

.71 

.32 

465 

.59 

.40 

1.05 

.54 

.74 

.37 

66.7 

.35 

.71 

.47 

Mathemat ics 

263 

.50 

.45 

.24 

.39 

.65 

.28 

.34 

.52 

.22 

465 

.55 

.33 

.63 

.30 

.54 

.25 

667 

.34 

.42 

.34 

A 


-  208  - 


The  results  for  Booklet  1A  can  be  compared  to  those  for  Booklets  2  and  3  to 
examine  the  extent  of  context  effects.  The  change  in  context  between  the  param¬ 
eter  estimation  and  the  trait  estimation  did  not  systematically  affect  the  cor¬ 
relations  between  0j,  and  8y.  The  change  in  context  did  decrease  the  closeness 

of  the  equating  of  the  means  and  standard  deviations  and  increase  the  RMSD/S^ 
for  Reading  but  had  little  effect  on  Mathematics. 

For  Booklets  2,  4,  6  and  3,  5,  7  the  1-parameter  model  produced  higher  cor¬ 
relations  between  traits  than  the  3-parameter  model.  The  means  and  standard 
deviations  were  equated  with  about  equal  accuracy  for  the  two  models.  The  dif¬ 
ference  between  '  he  models  in  correlations  was  reflected  in  the  lower  RMSD/Sq 
for  the  1-param  ter  model. 

Effects  of  Ur  "ial  Item  Difficulties 

In  order  to  examine  the  equating  of  traits  based  on  items  of  unequal  diffi¬ 
culty,  the  X  and  Y  items  were  divided  into  sets  of  easier  (E)  and  harder  (H) 
items.  This  division  was  based  on  the  proportions  of  examinees  who  passed  the 
items  in  the  second  testing.  The  distributions  of  item  difficulties  overlapped 
for  the  E  and  H  sets,  but  the  mean  difficulties  for  the  two  sets  differed  by 
about  as  much  as  is  common  for  adjacent  levels  of  standardized  achievement 
tests.  There  were  20  items  in  the  E  set  and  20  items  in  the  H  set  for  Reading 
and  22  items  in  each  of  the  two  sets  for  Mathematics.  The  item  parameters  for 
these  sets  were  those  obtained  in  the  first  testing  on  the  basis  of  the  pooled 
data  for  Booklets  2  to  7,  and  trait  estimates  were  based  on  second-testing  data. 

Table  12  presents  the  relationships  between  the  traits  based  on  the  E  and  H 
sets.  The  quality  of  the  equating  of  the  E  and  H  sets  can  be  compared  to  the 
equatings  of  the  X  and  Y  sets  in  Table  11  for  Booklets  2,  4,  6  and  3,  5,  7.  The 
correlations  for  0g  and  0^  were  very  similar  to  the  corresponding  correlations 

for  0X  and  0y.  For  Mathematics  with  the  3-parameter  model,  the  0g  and  0^  equat¬ 
ing  was  only  slightly  worse  than  the  corresponding  0^  and  0y  equating.  For 
Reading  for  both  models  and  for  Mathematics  with  the  1-parameter  model,  the  0g 
and  0jj  equatings  of  means  and  standard  deviations  were  noticeably  poorer  than 
the  0X  and  0y  equatings. 

For  Reading  the  1-parameter  model  produced  higher  correlations  and  lower 
RMSD/Sg  for  0g  and  9^  than  the  3-parameter  model,  whereas  the  3-parameter  model 

produced  slightly  better  equatings  of  means  and  standard  deviations.  For  Mathe¬ 
matics  the  3-parameter  model  produced  closer  equatings  of  means  and  standard 
deviations,  a  slightly  lower  RMSD/Sg,  and  an  equal  correlation  of  9g  and  0H  as 

compared  with  the  1-parameter  model. 

Effect  of  Trait  Level 


For  all  the  second-testing  trait  equatings,  the  difference  between  the  1- 
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Figure  1 

Second-Testing  RMSD/Sg  for  6x  vs.  0y,  Based  on  Item 
Parameters  Estimated  Using  First-Testing  Booklets  2,  4,  6  &  3,  5,  7 


(a)  Reading 


(0X+  0 Y ) / 2 


(b)  Mathematics 


3-Parameter 
1 -Parameter 
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and  3-parameter  models  in  terms  of  correlations  and  RMSD/Sg  was  essentially  the 
result  of  relatively  large  between-trait  differences  for  low  trait  values  esti¬ 
mated  by  the  3-parameter  model.  To  display  this  effect,  the  average  trait  esti¬ 
mate  for  the  X  items  and  the  Y  items,  (0^  +  0y)/2,  was  found  for  each  examinee. 
The  range  of  these  values  was  divided  into  20  cells.  Examinees  were  sorted  into 
cells  on  the  basis  of  their  mean  trait  values,  and  within  each  cell  the  RMSD/Sg 
between  ©x  and  0y  was  found.  Figure  1  contain  plots  of  these  RMSD/Sg  for  traits 
based  on  item  parameters  estimated  using  Booklets  2,  A,  6  and  3,  5,  7.  For  low 
trait  values,  the  RMSD/Sg  was  much  greater  for  the  3-parameter  model  than  for 
the  1-paramet.er  model;  these  trait  values  correspond  to  number-correct  scores 
below  those  expected  by  random  guessing.  The  RMSD/Sg  was  lower  for  the  3-param- 
eter  model  than  for  the  1-parameter  model  for  cells  which  included  about  70%  of 
the  examinees  for  Reading  (Figure  la)  and  about  60%  of  the  examinees  for  Mathe¬ 
matics  (Figure  lb). 

Observed  versus  expected  proportion  passing  an  item.  Cross-validations  of 
the  model  predictions  were  made  using  chi-squares  from  the  examinees'  item  re¬ 
sponses  in  the  second  testing;  these  item  responses  produced  the  observed  pro¬ 
portions  passing  the  items  (0 i j ) .  The  expected  proportions  passing  the  items 
(E^j)  were  found  using  the  item  parameters  estimated  in  the  first  testing  and 
the  traits  estimated  in  the  second  testing  that  were  based  on  the  first-testing 
item  parameters  and  the  second-testing  item  responses.  (These  trait  estimates 
produced  the  results  in  Table  11.)  The  chi-squares  were  obtained  for  the  X  items 
and  for  the  Y  items.  The  means  (taken  over  items)  of  these  chi-squares  appear 
in  Table  13. 

Table  13 

Mean  Item  Chi-squares  for  the  X+Y 
Items  for  the  Second  Testing 


First-Testing  Booklets 

Model 

3-Parameter  1 

-Parameter 

Reading  (N=l,525) 

1A 

35 

60 

25.3 

54 

63 

2,4,6&3, 5, 7 

37 

53 

Mathematics  (N=l,778) 

1A 

34 

53 

2&3 

60 

72 

2,4,66.3,5,7 

43 

57 

Because  no  item  parameters  were  estimated  from  the  data  on  which  the  chi- 
squares  were  based,  each  item  chi-square  had  10  degrees  of  freedom  for  both  the 
1-  and  3-parameter  models.  Comparing  the  mean  chi-squares  for  Booklets  1A  and  2 
and  3,  it  can  be  seen  that  when  there  was  a  change  in  context  from  the  first  to 
the  second  testing,  chi-squares  were  higher  than  when  the  context  was  constant 
from  the  first  to  the  second  testing.  Pooling  over  contexts  and  increasing  sam¬ 
ple  sizes  for  the  item  parameter  estimates  (Booklets  2,  4,  6  a: d  3,  5,  7)  de¬ 
creased  the  chi-squares  below  the  level  found  for  Booklets  2  and  3,  but  usually 
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not  below  the  level  for  Booklet  1A.  The  chi-squares  for  the  3-parameter  model 
were  lower  than  the  chi-squares  for  the  1-parameter  model.  An  examination  of 
the  observed  and  predicted  proportions  of  examinees  passing  the  items  revealed 
that,  on  the  average,  the  3-parameter  model  made  more  accurate  predictions  than 
the  1-parameter  model. 


Discussion  and  Conclusion 


Item  Parameters 


It  was  a  consistent  finding  that  X  +  Y  item  parameters  estimated  from  the 
same  booklet  were  more  highly  correlated  than  X  +  Y  item  parameters  estimated 
from  different  booklets.  There  are  several  factors  that  could  have  produced 
this  finding: 

1.  The  inclusion  of  extra  items  (Sets  A,  V,  or  W)  in  some 
of  the  booklets; 

2.  The  linking  of  parameters  from  different  booklets  by  the 
use  of  anchor  items; 

3.  Differences  in  the  sample  size  used  for  the  parameter 
estimations ; 

4.  Interactions  between  the  ability  level  of  a  sample  and 
the  parameter  estimations; 

5.  Differences  in  the  number  of  items  scaled  together; 

6.  Systematic  differences  in  the  sequence  in  which  items 
appeared  in  different  booklets;  and 

7.  Unspecified  context  effects  other  than  sequence. 

The  evidence  for  and  against  the  importance  of  these  factors  is  examined  as  fol¬ 
lows  . 

Inclusion  of  extra  items  (Sets  A,  V  or  W)  in  some  of  the  booklets.  Book¬ 
lets  4  to  7  contained  anchor  items  (Set  A  items)  that  were  not  included  in  Book¬ 
let  1.  It  is  possible  that  these  items  altered  the  trait  measured  by  the  book¬ 
let.  For  example,  imagine  that  the  Mathematics  Set  A  items  were  all  graph-read¬ 
ing  items  and  that  no  graph-reading  items  appeared  in  Sets  X  and  Y.  (This  did 
not  occur,  but  it  gives  an  extreme  example  of  how  the  Set  A  items  could  alter 
the  trait  being  measured.)  To  all  appearances,  the  Set  A  items  did  not  seem  to 
have  content  systematically  different  from  the  X  +  Y  items,  but  it  is  possible 
that  they  were  statistically  different  from  the  X  +  Y  items.  If  the  Set  A  items 
did  alter  the  trait  being  measured,  then  the  item  parameters  could  have  been 
affected.  If  this  occurred,  the  correlations  between  item  parameters  estimated 
in  Booklets  4  and  5  and  6  and  7  C_£_4&  5 , 6& 7  ^  should  have  been  higher  than  £ia,46i5» 
11B.4&5.  £1A,6&7>  and  £1B,6&7"  rt  would  also  be  expected  that  £45,5,66,7  would 
approximately  equal  £j^  ^g.  An  examination  of  Table  6  does  not  support  these 
hypothesized  relationships  among  the  correlations. 

Booklets  2  and  3  contained  not  only  Set  A  items  but  also  Sets  V  and  W. 

Thus,  Booklets  2  and  3  differed  more  in  content  from  Booklet  1  than  did  Booklets 
4  and  5  and  Booklets  6  and  7,  If  the  inclusion  of  extraneous  items  caused  item 


parameters  to  change,  the  £1a,2&3  and  £1B,2&3  should  be  lower  than  £jia,4&5» 

£IB  4&5»  HA  6&7»  and  £1B,6&7‘  However,  an  examination  of  Table  6  reveals  that 
the  correlations  between  items  in  Booklets  2  and  3  and  Booklets  1A  or  IB  tended 
to  be  higher  than  the  correlations  between  Booklets  4  and  5  or  6  and  7  and  Book¬ 
lets  1A  or  IB. 

Thus,  it  does  not  appear  that  the  inclusion  of  extra  items  in  Booklets  2  to 
7  was  a  major  factor  in  reducing  the  correlations  between  item  parameters  for 
those  booklets  and  Booklet  1. 

The  linking  of  parameters  from  different  booklets  by  use  of  anchor  items. 
Results  indicate  that  the  linking  procedure  did  not  cause  the  reduction  in  item 
parameter  correlations  from  Booklets  1  to  Booklets  2  to  7.  It  should  be  noted 
that  the  procedure  used  here  for  linking  the  items  was  chosen  as  the  best  proce¬ 
dure  from  among  several  others:  linking  by  estimating  the  item  parameters  in 
matched  samples,  linking  by  using  the  first  principal  component  of  the  anchor 
item  difficulties,  and  linking  by  using  the  mean  anchor  item  difficulties  and 
discriminations.  The  procedure  used  here  produced,  in  general,  the  highest  cor¬ 
relations  among  the  item  parameters  and  the  best  equatings  of  0^  and  0y. 

Differences  in  sample  size  used  for  the  parameter  estimations.  For  Reading 
the  sample  sizes  used  for  obtaining  parameter  estimates  for  Booklets  2  and  3,  4 
and  5,  and  6  and  7  were  smaller  than  those  for  Booklets  1A  and  IB.  It  is  possi¬ 
ble  that  these  sample  sizes  were  sufficiently  smaller  to  have  caused  the  item 
parameters  to  be  noticeably  less  stable.  Because  Booklets  2  and  3  had  the  high¬ 
est  sample  sizes  among  Booklets  2  to  7,  it  would  be  expected  that  £ja,2&3> 

11B.2&3.  and  £.1,26,3  would  be  higher  than  £iA,4&5,  £1B,4&5*  £1,4&5>  £1A,66.7» 
£1B,6&7*  and  £1,6&7*  An  examination  of  Table  6  verifies  this  pattern  of  corre¬ 
lations.  The  differences  in  sample  sizes  would  also  suggest  that  r_2&3  445  and 
£2&3,6&7  should  be  higher  than  £4&5,6&7»  this  pattern  of  correlations  does  not 
appear  in  Table  6.  Furthermore,  it  should  be  recalled  that  for  Mathematics  the 
sample  sizes  were  as  large  or  slightly  larger  for  Booklets  2  and  3,  4  and  5,  and 

6  and  7  than  for  Booklets  1A  and  IB;  but  the  reduction  in  item  parameter  corre¬ 
lations  observed  with  a  change  in  booklets  appeared  for  Mathematics,  as  well  as 
for  Reading. 

Interactions  between  ability  level  of  a  sample  and  the  parameter  estima¬ 
tions.  It  is  possible  that  the  samples  of  examinees  obtained  for  Booklets  2  to 

7  were  systematically  different  from  the  samples  obtained  for  Booklet  1.  For 
example,  severe  floor  or  ceiling  effects  could  affect  the  accuracy  and  values  of 
item  parameter  estimates.  However,  the  distributions  of  abilities  for  the 
first-testing  booklets  appeared  quite  similar.  The  mean  proportion  of  items 
passed  for  the  various  first-testing  booklets  ranged  from  .57  to  .61  for  Reading 
and  from  .54  to  .59  for  Mathematics.  These  results  argue  against  the  sample 
composition  having  had  an  important  impact  on  the  item  parameter  estimates. 

Differences  in  the  number  of  items  scaled  together.  There  were  the  same 
number  of  items  calibrated  in  Booklets  1,  and  2  and  3,  but  fewer  items  were  cal- 
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ibrated  in  Booklets  4  and  5  and  6  and  7.  Item  parameters  may  have  been  estimat¬ 
ed  more  stably  when  more  items  were  calibrated  together.  To  examine  this  hy¬ 
pothesis,  the  parameters  for  Booklets  2  and  3  were  recalibrated  excluding  the  V 
and  W  items.  In  this  recalibration.  Booklets  2  and  3  had  the  same  number  of 
items  as  Booklets  4  and  5  and  6  and  7.  The  resulting  parameters  for  Booklets  2 
and  3  were  correlated  with  the  second-testing  Booklet  1  parameters,  and  these 
correlations  were  compared  with  the  corresponding  correlations  in  Table  7.  Only 
two  of  the  six  correlations  changed:  for  Reading  the  3-parameter  item  discrim¬ 
ination  and  difficulty  correlations  changed  from  .63  to  ,6i  and  from  .76  to  .77. 
Thus,  it  appeared  that  the  number  of  items  being  calibrated  did  not  have  an  im¬ 
portant  effect  on  the  stability  of  the  item  parameter  estimates. 

Systematic  differences  in  the  sequence  in  which  items  appeared  in  different 
booklets .  The  sequence  in  which  items  appeared  within  a  booklet  could  have  had 
an  influence  on  item  parameter  correlations.  If  sequence  is  important,  it  would 
be  expected  that  item  parameters  obtained  from  booklets  with  similar  item  se¬ 
quences  would  be  more  similar  than  item  parameters  obtained  from  booklets  with 
dissimilar  item  sequences.  Recall  that  Table  3  contains  rank-order  correlations 
of  item  sequences  between  Booklet  1  and  Booklets  2  to  7.  These  correlations 
would  lead  to  the  expectation  that  for  Reading,  r  j  >  2  would  be  greater  than  jrj  4, 
which  would  be  greater  than  £j  ^  ^ ;  also,  rj  5  would  be  greater  than  r j  3,  which 
would  be  greater  than  rj^.  For  Mathematics  the  sequences  of  Set  X  items  in 
Booklets  2,  4,  and  6  had  similar  correlations  with  Booklet  1,  which  would  lead 
to  the  expectation  that  £^2*  £_1,4>  ar,d  £.1,6  would  be  similar;  for  the  Set  Y 
items,  it  would  be  expected  that  £^5  would  be  greater  than  £^3,  which  would  be 
greater  than  £_l57-  The  correlations  in  Table  8  are  partially  consistent  with 
these  expectations.  In  particular,  the  booklets  with  the  most  similar  item  se¬ 
quences  tended  to  have  more  highly  correlated  item  parameters  than  the  booklets 
with  the  least  similar  item  sequences. 

These  results  indicate  that  the  similarity  of  item  arrangements  might  have 
an  influence  on  the  similarity  of  item  parameters.  Part  of  this  influence  could 
be  the  result  of  examinee  fatigue  or  impatience  to  finish  the  test.  If  so, 
items  should  be  relatively  less  difficult  if  they  appear  at  the  beginning  of  a 
booklet  than  at  the  end  of  a  booklet.  In  Table  8  Reading  item  parameters  for 
Booklet  7  had  particularly  low  correlations  with  the  parameters  for  Booklet  1. 

A  passage  that  appeared  at  the  beginning  of  Booklet  1  and  near  the  end  of  Book¬ 
let  7  was  identified.  The  items  for  this  passage  were  all  relatively  more  dif¬ 
ficult  in  Booklet  7  than  in  Booklet  1.  These  items  also  had  relatively  higher 
discriminating  powers  in  Booklet  7  than  in  Booklet  1.  It  did  not  appear  that 
speededness  was  an  important  factor  because  93%  of  the  examinees  who  answered  at 
least  a  third  of  the  items  in  Booklet  7  answered  the  last  item  in  the  booklet. 

A  possible  explanation  is  that  a  significant  number  of  the  examinees  who  an¬ 
swered  questions  about  this  passage  near  the  end  of  Booklet  7  did  not  take  the 
care  that  examinees  took  when  the  passage  was  at  the  beginning  of  Booklet  1,  and 
that  for  Booklet  7  items  for  this  passage  were  important  in  discriminating  be¬ 
tween  the  higher  scoring/more  careful  examinees  and  the  lower  scoring/less  care¬ 
ful  examinees. 
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Several  other  analyses  similar  to  the  one  described  in  the  previous  para¬ 
graph  were  conducted.  Items  appearing  at  the  end  of  booklets  frequently,  but 
not  always,  were  relatively  more  difficult  than  the  same  items  appearing  at  the 
beginning  of  another  booklet.  Results  for  item  discriminations  were  not  as  sys¬ 
tematic.  Thus,  it  appeared  that  the  location  of  an  item  in  a  booklet  could 
have,  but  did  not  have  to  have,  an  impact  on  the  item's  parameters. 

Unspecified  context  effects  other  than  sequence.  Although  the  location  of 
items  in  the  booklets  appeared  to  be  a  partial  explanation  of  context  effects  on 
item  parameters,  location  did  not  appear  to  be  a  complete  explanation.  It  is 
possible  that  other  factors  related  to  context  could  have  influenced  the  parame¬ 
ters.  For  example,  such  factors  might  be  specific  to  the  particular  content  of 
items.  It  is  not  apparent,  however,  exactly  what  these  factors  would  be. 

Conclusions.  After  an  examination  of  seven  factors  that  could  possibly 
have  influenced  the  stability  of  parameter  estimates,  the  conclusion  reached  is 
that  context  effects  are  not  artifacts  but  can  be  the  result  of  an  item's  loca¬ 
tion  in  a  booklet  and,  conceivably,  other  unexplained  context  effects.  It  may 
not  be  possible  to  obtain  truly  context-free  item  parameters.  However,  it  may 
be  possible  to  obtain  approximately  context-free  item  calibrations  by  basing  the 
item  calibrations  on  data  pooled  over  administrations  of  the  items  in  a  variety 
of  contexts. 

Traits 


Systematic  differences  in  item  parameter  estimates  can  be  important.  For 
example,  suppose  that  the  second-testing  trait  estimates  were  based  on  the  X  +  Y 
items  using  parameters  from  the  first  testing.  Because  the  mean  item  difficul¬ 
ties  varied  as  a  function  of  the  first-testing  booklet  and  sample  (see  Table  9), 
the  means  of  the  second-testing  traits  would  also  vary.  Variations  in  the  esti¬ 
mated  item  discriminations  would  influence  the  standard  errors  of  the  traits 
that  would  be  predicted  by  the  3-paramiter  model. 

Obtaining  equated  trait  estimates  is  one  of  the  most  important  tests  of  the 
usefulness  of  the  latent  trait  models.  When  item  parameters  were  based  on  data 
pooled  over  contexts  (Booklets  2,4,6  and  3,5,7),  second-testing  trait  estimates 
based  on  items  of  approximately  equal  difficulty  (9^  and  0y)  were  fairly  well 
equated.  Trait  estimates  based  on  items  of  systematically  different  difficulty 
levels  (9g  and  8^)  were  less  well  equated.  It  is  encouraging  that  well-equated 
traits  were  obtainable  despite  the  presence  of  context  effects  on  the  item  pa¬ 
rameters,  but  it  is  apparent  that  equating  errors  can  be  expected  to  be  greater 
for  vertical,  than  for  horizontal,  equating. 

One  potential  use  of  latent  trait  models  with  item  pools  is  in  basing  an 
examinee's  trait  estimate  on  a  subset  of  the  items  in  the  pool  and  predicting 
whether  the  examinee  would  have  passed  items  in  the  pool  he  or  she  did  not  take. 
This  provides  a  method  of  criterion  referencing  an  examinee's  trait  value.  If 
context  effects  influence  item  parameters,  such  criterion  referencing  will  be 
inaccurate. 
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Models 


Recall  that  one  of  the  criteria  in  the  selection  of  items  for  this  study 

was  fit  of  the  items  to  the  predictions  of  a  2-parameter  logistic  model  (in 

which  c^  =  0  for  all  items).  The  V  and  W  items  were  chosen  from  the  items  that 

had  relatively  poor  fit  to  that  model;  and  the  A,  X,  and  Y  items  were  chosen 

from  the  items  that  had  relatively  good  fit.  In  theory,  an  item  that  fits  the 

2- parameter  model  will  fit  the  3-parameter  model  but  will  not  necessarily  fit 
the  1-parameter  model.  This  theory  implies  that  the  item  selection  procedure 
was  biased  in  favor  of  the  3-parameter  model.  In  practice,  however,  this  bias 
did  not  occur.  Selection  on  the  basis  of  fit  to  the  2-parameter  model  had  the 
effect  of  discarding  items  that  had  the  poorest  chi-squares  for  both  the  1-  and 

3- parameter  models. 

The  item  that  were  retained  for  Sets  A,  X,  and  Y  were  among  those  that  had 
the  best  fit  for  both  the  1-  and  3-parameter  models;  but  these  items  were  not 
those  that  systematically  had  the  best  fit  for  either  one  of  the  models.  The 
mean  item  chi-squares  for  the  items  that  were  discarded  or  placed  in  Sets  V  and 
W  were  10.5  (Reading)  and  12.3  (Mathematics)  for  the  3-parameter  model  and  21.5 
(Reading)  and  24.7  (Mathematics)  for  the  1-parameter  model.  The  mean  item  chi- 
squares  for  the  items  chosen  for  Sets  A,  X,  and  Y  were  7.1  (Reading)  and  7.7 
(Mathematics)  for  the  3-parameter  model  and  14.6  (Reading)  and  13.8  (Mathemat¬ 
ics)  for  the  1-parameter  model.  It  is  clear  that  the  selection  of  the  A,  X,  and 
Y  sets  of  items  had  a  much  greater  effect  on  the  mean  of  the  1-parameter  chi- 
squares  than  on  that  of  the  3-parameter  chi-squares.  It  is  also  clear  that  the 
items  chosen  for  Sets  A,  X,  and  Y  fit  the  3-parameter  model  much  better  than  the 

1-parameter  model.  Even  if  the  items  had  been  chosen  on  the  basis  of  having  the 

best  fit  with  respect  to  the  1-parameter  model,  the  chosen  items  would  have  had 
a  higher  mean  chi-square  for  the  1-parameter  than  for  the  3-parameter  model. 

For  small  sample  sizes  (N  =;  200),  the  3-parameter  model  produced  less  sta¬ 
ble  item  difficulties  than  the  1-parameter  model.  For  larger  sample  sizes,  the 
two  models'  difficulties  were  essentially  equally  stable.  This  result  argues 
for  the  use  of  the  1-parameter  (Rasch)  model  for  small  sample  sizes.  However, 
the  trait  equatings  based  on  item  parameters  estimated  with  small  sample  sizes 
were  frequently  so  poor  that  it  does  not  appear  prudent  to  use  either  model  with 
small  sample  sizes. 

The  two  models  differed  in  the  types  of  errors  they  displayed  in  the  equat¬ 
ing  of  traits.  The  3-parameter  model  tended  to  produce  more  unsystematic  or 

random  error  than  the  1-parameter  model  for  low  trait  values  (i.e.,  trait  values 
associated  with  number-correct  scores  below  those  expected  by  random  guessing). 
The  1-parameter  model  tended  to  produce  greater  systematic  errors  in  trait 
equatings  than  the  3-parameter  model,  as  exhibited  in  the  quality  of  the  equat¬ 
ing  of  means  and  standard  deviations.3 

When  the  predictions  of  the  latent-trait  models  were  evaluated  in  a  type  of 
cross-validation  (see  the  mean  chi-squares  in  Table  13),  the  3-parameter  model 
produced  more  accurate  predictions,  on  the  average,  than  the  1-parameter  model. 

3For  an  explanation  of  these  results,  see  Lord  (1980),  in  this  volume. 
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This  result  argues  for  the  use  of  the  3-parameter  model  rather  than  the  1-param- 
eter  Rasch  model,  particularly  for  multiple-choice  tests  with  few  answer 
choices . 
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The  application  of  the  technology  of  computer-dr iven  adaptive  testing  re¬ 
quires  the  development  of  large  banks  of  test  items.  Each  bank  may  contain  250 
to  400  items,  and  all  must  measure  the  same  ability  on  the  same  metric  or  scale. 
It  is  unreasonable  and  impracticable  to  assemble  a  single  group  of  2,000  sub¬ 
jects  for  250  to  400  minutes  in  order  to  obtain  data  on  all  the  items;  there¬ 
fore,  a  method  for  linking  together  subsets  of  items  administered  to  varying 
groups  must  be  investigated.  Item  characterist ic  curve  (ICC)  theory  offers  a 
unique  method  of  linking  subsets  of  test  items  due  to  the  invariance  property  of 
the  ICC  parameters.  This  invariance  property  rests  on  the  two  major  theoretical 
assumptions  of  latent  trait  theory:  (1)  unidimensionality  and  (2)  local  inde¬ 
pendence. 

Unidimeusionality  means  that  only  a  single  ability  is  being  measured  and  is 
assumed  to  be  the  property  of  an  item  pool,  even  when  assembled  into  subsets. 
Local  independence  means  that  testees'  responses  to  an  item  are  independent  of 
their  responses  to  another  item.  More  simply  stated,  this  means  that  an  item 
response  is  a  function  of  ability  and  no  other  factor.  In  effect,  this  is  a 
restatement  of  the  unidimensionality  assumption.  If  an  item  pool  is  unidimen¬ 
sional,  then  any  shift  in  score  metric  that  is  due  to  a  linear  transformation 
may  be  corrected  or  adjusted  by  application  of  the  proper  complementary  linear 
transformation.  This  is  what  is  meant  by  the  idea  that  latent  trait  parameters 
are  invariant  to  a  linear  transformation,  and  it  is  this  theoretical  property 
that  allows  item  pools  to  be  linked  and  transformed  to  a  common  metric. 

In  previous  research  efforts,  item  pools  have  been  linked  by  the  method  of 
linear  equating  (see  Lord,  1977;  Ree,  1977;  Sympson  &  Ree,  in  press)  with  appar¬ 
ent  success.  To  date  there  has  been  little  research  on  the  efficacy  of  these 
linking  procedures  and  the  effects  of  errors  in  ICC  parameter  estimation  on 
their  (linearly)  transformed  values. 

ICC  Parameters 


The  three-parameter  logistic  model  of  Birnbaum  (1968)  is  the 
ly  used  for  relating  item  responses  to  testee  ability.  The  three 
b,  and  c — are  item  discrimination,  item  difficulty  (or  location), 
ty  of  chance  success  (or  lower  asymptote),  respectively. 


most  frequent- 
parameters — a_, 
and  probabili- 


I 

1 


I 


The  curve  described  by  these  parameters  takes  the  shape  of  an  (cumulative 
frequency)  ogive  or  an  "S,"  with  the  upper  asymptote  approaching  a  probability 
of  1.0  and,  usually,  with  a  lower  asymptote  of  a  probability  greater  than  0.0. 
The  ogive  describes  the  probability  of  obtaining  a  correct  answer  to  an  item  as 
a  monotonic  increasing  function  of  ability. 

The  item  discrimination  parameter  (a.)  is  a  function  of  the  slope  of  the  ICC 
and  generally  ranges  from  .5  to  about  2.5.  The  value  of  a  equal  to  about  1.0  is 
typical  of  many  test  items,  a  values  below  .5  are  insufficiently  discriminating 
for  most  testing  purposes,  and  a_  values  above  2.0  are  infrequently  found. 

The  item  difficulty  parameter  (b)  describes  the  point  of  inflection  of  the 
ICC  and  is  usually  scaled  between  -2.5  and  +2.5,  although  the  metric  is  arbi¬ 
trary.  The  item  guessing  parameter  (c_)  is  the  lower  asymptote  of  the  ICC  and  is 
generally  interpreted  as  the  probability  of  selecting  the  correct  item  option  by 
chance  alone.  Most  test  items  have  c_  parameters  greater  than  0.0  and  less  than 
or  equal  to  .30. 

Figure  1  shows  three  ICCs.  The  horizontal  axis  is  scaled  in  units  of  abil¬ 
ity,  0,  and  the  vertical  axis  is  the  probability  of  answering  the  item  correctly 
[P(l|0)J.  The  solid  curved  line  shows  an  ICC  for  an  item  of  average  difficulty 
with  acceptable  discrimination  and  the  lower  asymptote  appropriate  for  a  five- 
item  multiple-choice  item.  The  dashed  line  shows  an  item  of  identical  diffi- 


Figure  1 

Item  Characteristic  Curves 

1.0  i 
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culty,  with  a  c_  value  of  .28,  but  a  lower  a^  value.  Note  how  the  slope  of  the 
curve  is  less  steep.  The  third  curve,  dot-dash  line,  shows  an  item  with  a  c_ 
value  of  .30,  an  ja  parameter  of  1.0,  and  a  t>  parameter  equal  to  1.0.  As  the  _b 
parameter  changes,  the  location  of  the  inflection  point  of  the  curve  is  dis¬ 
placed  along  the  horizontal  axis. 

Equation  1  presents  the  mathematical  function  describing  the  curve. 

(-1.7a. (6  -  b.))  . 

P( 0)  .  =  e.  +  (1  -  a.)  (1  +  e  1  t  )_1  .  [1] 

Previous  research  (Ree,  1978)  indicates  that  the  ICC  parameters  may  be  estimated 
with  some  reasonable  degree  of  accuracy,  providing  a  sufficient  sample  of  exam¬ 
inees  with  an  appropriate  distribution  of  ability  (9)  is  available. 

Linking  Paradigms 

Two  fundamental  linking  procedures  may  be  defined  and  are  known  as  the 
Anchor  Items  Method  (AIM)  and  the  Anchor  Subjects  Method  (ASM).  In  AIM  every 
subset  of  items  is  administered  to  a  different  sample  of  subjects,  but  embedded 
into  the  group  of  items  to  be  analyzed  is  a  common  (or  anchor)  set  of  items. 
During  analysis  the  anchor  items  are  identified,  and  the  following  linear  trans¬ 
formation  is  applied  to  the  resultant  ICC  parameters: 


where  is  the  item  location  parameter  transformed  to  the  desired  scale  and  sb,- 
and  sb 2  are  standard  deviations  of  the  desired  scale  and  observed  scale,  respec¬ 
tively.  A  similar  procedure  for  the  a  parameter  is  defined  by 


a 


t 


[3] 


where 

ar  is  the  item  discrimination  parameter  transformed  to  the  desired  scale; 
a2  is  the  observed  a  parameter;  and 
sb  t  and  ^b2  are  as  in  Equation  2. 


Because  the  c  parameter  is  measured  on  the  probability  axis,  it  does  not  change 
and  no  transformation  need  be  applied. 


The  ASM  requires  that  the  same  group  of  subjects  be  available  to  take  each 
subset  of  items.  It  is  extremely  unlikely  that  the  same  2,000  subjects  could  be 
assembled  to  take  items  over  a  long  period  of  time,  as  would  be  required  to 
place  tests  on  the  same  metric  from  year  to  year.  For  this  reason  the  ASM  meth- 
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od  seems  less  likely  to  find  long-term  practical  application.  Because  of  its 
potential  for  use,  the  AIM  procedure  is  the  subject  of  the  present  study. 

Method 


In  order  to  have  a  known  standard  for  reference,  a  simulation  study  was  run 
using  2  groups  of  simulated  testees,  a  single  set  of  20  anchor  items,  and  2  dif¬ 
fering  groups  of  60  experimental,  or  nonanchor,  items.  These  two  groups  of 
items  were  assembled  into  two  tests.  Both  groups  of  simulees,  designated  SI  and 
S2,  were  specified  to  have  about  the  same  normal  distribution  of  0.  Table  1 
shows  the  mean,  standard  deviation,  and  minimum  and  maximum  of  0  for  Groups  SI 
and  S2.  These  two  groups  represent  what  might  be  expected  if  subjects  for  ex¬ 
perimental  testing  were  chosen  from  a  larger  pool,  such  as  candidates  for  mili¬ 
tary  enlistment.  Response  vectors  for  these  simulees  were  generated  on  the  two 
tests . 

Table  1 


Mean,  Standard  Deviation,  and 
Minimum  and  Maximum  of  0  for 
Groups  SI  and  S2 


Statistic 

Group 

SI 

S2 

Mean 

-.014 

.025 

SD 

.998 

1.004 

Min  imum 

-2.600 

-2.600 

Maximum 

2.600 

2.600 

Generation  of  Item  Responses 

In  order  to  generate  a  vector  of  item  responses  for  each  simulee,  the  0 
values  were  used  in  Equation  1  to  compute  the  likelihood  of  correctly  answering 
each  item. 

Because  Equation  1  yields  a  number  P(6)j  such  that  0.0  <  P C 0 ) j  <  1.0,  a 
number  Xj  was  drawn  from  a  uniform  (rectangular)  distribution  ranging  from  0.0 
to  1.0  and  compared  to  P(0)j.  If  Xj  was  larger  than  P(0)j,  then  an  incorrect 
response  was  specified  for  the  item;  otherwise,  a  correct  response  was  speci¬ 
fied.  Thus,  a  simulee  with  P(0)j=.9O  would  answer  an  item  correctly  9  in  10 
times,  and  a  vector  of  item  responses  was  developed  for  each  simulee  in  each 
data  set.  These  response  vectors  were  then  used  to  investigate  the  AIM  linking 
procedures . 

Table  2  shows  the  distribution  of  ICC  parameters  for  the  80  items  for  Test 
1  (Tl)  and  Test  2  (T2),  and  Table  3  shows  the  ICC  parameters  for  the  20  anchor 
items  common  to  both  tests. 


Simul.ees  from  Group  SI  were  administered  only  the  items  in  Test  1,  and  sim- 


222 


Table  2 

Mean  and  Standard  Deviation  of  the 
Generated  Item  Parameters  for 
Tests  1  and  2 


Item 

Parameter 
and  Statistic 

Test  1 

Test  2 

a 

Mean 

1.056 

1.045 

SD 

.279 

.239 

b 

Mean 

.085 

-.056 

SD 

.844 

.858 

c 

Mean 

.188 

.202 

SD 

.054 

.047 

ulees  from  Group  S2  only  the  items  in  Test  2.  In  order  to  study  the  effects  of 
sample  size,  the  ICC  parameters  were  estimated  on  4  samples  drawn  with  replace¬ 
ment  as  follows:  250,  500,  1,000,  and  2,000.  The  ICC  parameters  were  estimated 
on  these  4  sample  sizes  for  both  groups.  Anchor  ICC  parameter  values  from  the  4 
samples  administered  Test  1  served  as  the  input  values  for  the  anchor  item  pa¬ 
rameters  to  the  second  test.  This  permitted  the  4  sizes  of  the  calibration  sam¬ 
ple  (250,  500,  1,000,  2,000)  to  be  varied  and  to  be  applied  in  the  4  samples 
used  to  estimate  the  anchor  item  ICC  parameters. 

Results 


Table  4  shows  the  intercorrelations  between  the  known  item  parameters  and 
the  estimated  parameters.  As  past  research  indicates  (Urry,  1976),  correlations 
increased  with  increasing  sample  size.  The  correlations  in  Test  1  for  b  and 
estimates  of  b_  started  high,  at  .952,  and  increased  to  an  exceptionally  high 
.992.  Correlations  for  £  and  estimates  of  £  began  moderately,  at  .666,  and 
climbed  to  .869;  but  the  correlations  of  £  and  estimated  c_  increased  from  only 
.031  to  .115.  In  Test  2  much  the  same  pattern  was  observed  except  that  the  cor¬ 
relation  of  £  and  estimated  c  increased  from  .164  to  .315  as  sample  size  in¬ 
creased  . 

Because  correlations  are  insensitive  to  constant  differences,  as  might  be 
found  if  ICC  parameters  were  either  over-  or  unde r -e s t ima ted  by  a  constant 
amount,  summed  absolute  deviations  of  the  estimated  parameters  from  the  known 
parameters  were  computed  for  each  parameter  in  each  sample  size.  Table  5  pre¬ 
sents  the  summed  absolute  deviations  (or  summed  errors)  for  both  tests  with  the 
four  sample  sizes.  Figure  2  displays  this  graphically. 

There  was  a  large  drop  in  summed  error  when  the  a  parameter  was  estimated 
on  progressively  larger  samples  up  to  and  including  the  difference  between  1,000 
and  500  simulees.  Between  1,000  and  2,000  simulees  the  difference  in  summed 
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Table  3 

ICC  Item  Parameters  of  the  20 
Anchor  Items  Common  to  Both  Tests 


Anchor 

Item 

Number 

ICC  Item  Parameter 

a 

b 

£ 

1 

.80 

-1.50 

.10 

2 

.80 

-1.35 

.10 

3 

1.00 

-1.20 

.15 

4 

1.00 

-1.05 

.15 

5 

1.10 

-.90 

.20 

6 

1.20 

-.75 

.20 

7 

1.20 

-.60 

.22 

8 

1.20 

-.45 

.20 

9 

1.30 

-.30 

.20 

10 

1.40 

-.15 

.20 

11 

1.40 

.15 

.22 

12 

1.30 

.30 

.25 

13 

1.20 

.45 

.20 

14 

1.20 

.60 

.22 

15 

1.10 

.75 

.22 

16 

1.00 

.90 

.20 

17 

1.00 

1.05 

.25 

18 

.80 

1.25 

.25 

19 

.80 

1.35 

.25 

20 

.80 

1.50 

.25 

Mean 

1.06 

.00 

.20 

SD 

.21 

.95 

.04 

error  was  smaller.  The  relationship  between  error  and  sample  size  for  the  b 
parameter  was  more  nearly  constant.  That  is,  the  line  on  the  figure  for  esti¬ 
mates  of  £  is  generally  straight,  which  means  error  tended  to  be  reduced  in  di¬ 
rect  proportion  to  the  number  of  simulees.  The  almost  flat  line  for  the  c  pa¬ 
rameter  indicates  that  virtually  no  reduction  of  error  occurred  with  increasing 
sample  size  for  that  parameter.  The  average  absolute  deviation  for  the  c  param¬ 
eter  was  almost  one-third  of  the  entire  range  of  the  parameter,  as  the  c  parame¬ 
ter  is  generally  estimated  between  .00  and  .30.  However,  past  research  (Ree, 
1979)  indicates  that  even  for  low-ability  subjects,  the  effects  of  errors  in  the 
estimation  of  the  c  parameter  are  small. 

Summed  deviations  of  known  ICC  parameters  from  the  equated  value  of  the  ICC 
parameters  were  computed  for  the  £  and  b^  parameters  for  the  16  combinations  of 
calibration  sample  size  and  equating  sample  size.  Table  6  shows  the  summed  ab¬ 
solute  deviations  and  the  per  item  deviation  for  both  parameters  for  the  16  com¬ 
binations.  The  equated  a  parameter  showed  large  summed  deviations  whenever  the 
sample  was  limited  to  250  simulees,  whether  in  the  calibration  or  the  equating 
sample.  The  lowest  error  rates  for  the  a_  parameter  occurred  when  the  anchor 
item  values  were  estimated  on  2,000  simulees.  The  effects  of  the  size  of  the 
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Table  4 

Intercorrelations  between  Known  and 
Estimated  ICC  Item  Parameters  for 
Tests  T1  and  T2  for  Both  Groups 
with  Varying  Sample  Sizes 


Item  Parameter 
and 

Sample  Size 

Test  1 

Test  2 

a 

250 

.666 

.512 

500 

.671 

.725 

1000 

.831 

.813 

2000 

.869 

.886 

b 

250 

.952 

.929 

500 

.964 

.962 

1000 

.980 

.979 

2000 

.992 

.987 

C 

250 

.031 

.164 

500 

.035 

.109 

1000 

-.012 

.331 

2000 

.115 

.315 

Table  5 

Summed  Absolute  Deviations  (E  |Error| )  and 

Average  Absolute 

Deviations  ( 
Tests  1  and  2, 

(Error |)  for  the  Three  ICC  Item  Parameters 
for  Both  Groups  with  Varying  Item  Sample 

for 

Sizes 

Item  Parameter 
and 

Test  1 

Test 

2 

Sample  Size 

EjErrorl  |ferror| 

E  )Error| 

lErr°r| 

a 


250 

30.645 

.383 

30.529 

.382 

500 

22.809 

.285 

20.691 

.259 

1000 

15.749 

.197 

16.891 

.211 

2000 

u 

15.598 

.195 

15.139 

.189 

D 

“  250 

23.505 

.294 

20.847 

.261 

500 

19.860 

.248 

16.607 

.208 

1000 

17.689 

.221 

13.805 

.173 

2000 

12.735 

.159 

11.513 

.144 

C 

250 

7.736 

.097 

7.235 

.090 

500 

7.360 

.092 

7.512 

.094 

1000 

6.908 

.086 

7.318 

.092 

2000 

6.440 

.081 

6.864 

.086 
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Figure  2 

Errors  in  Estimation  of  ICC  Parameters 


calibration  sample  were  not  as  clear.  When  2,000  subjects  were  used  to  estimate 
the  anchor  item  ICC  parameters,  the  magnitude  of  the  error  was  approximately  the 
same  for  all  calibration  sample  sizes  except  250. 

With  increasing  calibration  sample  size  the  error  rate  increased  by  some 
small  amount,  as  indicated  by  the  average  (per  item)  absolute  deviation  error. 
This  is  an  unexpected  result;  an  explanation  may  be  found  in  the  relationship 
between  the  sets  of  estimated  a_  parameters.  If  the  estimated  a  parameters  were 
all  estimates  of  the  same  value  and  if  the  test  scale  were  unidimensional  (a 
basic  assumption  of  the  theory),  then  the  estimated  a  parameters  should  be  lin¬ 
ear  transformations  of  one  another  and  should  be  correlated  1.0,  as  correlations 
are  invariant  to  a  linear  transformation.  This  was  not  found  to  be  the  case, 
and  Table  7  shows  the  intercorrelation  of  the  estimated  a_  parameters.  Only  the 
correlation  between  the  estimate  of  a  calculated  on  1,000  simulees  and  the  esti¬ 
mate  of  calculated  on  2,000  simulees  approached  this  relationship.  This  lack 
of  linearity  may  be  due  to  the  assumption  of  normality  and  to  the  rescaling  used 
in  the  calibration  procedure;  these  may  interact  in  such  a  way  as  to  produce  the 
anomalous  results. 

Table  7  also  shows  the  intercorrelations  of  estimated  b  parameters.  All 
exceeded  .90,  and  the  summed  deviations  also  showed  a  steady  decrease  as  sample 
size  increased  for  the  b^  parameter,  indicating  a  virtually  linear  transformation 
of  estimated  b^  parameters  from  sample  to  sample.  However,  with  500  simulees  in 
the  equating  sample,  a  similar  anomaly  was  observed,  which  may  also  be  due  to 
normal  assumptions  and  to  rescaling. 
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Table  6 

Summed  Absolute  Deviations  (ZjErrorl)  and  Average  Absolute 
Deviations  (lErrorj)  for  Item  Parameters  a_  and  b^  for 
Various  Calibration  and  Equating  Sample  Sizes 


Sample 

Size 

Item  Parameter 

a 

1 

Cal ibrat ion 

Equating 

£jError| 

jErrorj 

Z|Error| 

lError I 

250 

2000 

34.226 

.428 

23.368 

.292 

500 

2000 

15.128 

.189 

21.934 

.274 

1000 

2000 

15.987 

.120 

16.366 

.205 

2000 

2000 

16.596 

.207 

13.458 

.168 

250 

1000 

38.363 

.480 

25.644 

.321 

500 

1000 

17.679 

.221 

24.341 

.304 

1000 

1000 

19.587 

.245 

19.116 

.239 

2000 

1000 

21.032 

.263 

16.883 

.211 

250 

500 

48.611 

.608 

25.437 

.318 

500 

500 

24.558 

.307 

22.899 

.286 

1000 

500 

28.829 

.360 

18.187 

.227 

2000 

500 

31.209 

.390 

15.833 

.198 

250 

250 

44.312 

.554 

26.201 

.328 

500 

250 

21.577 

.270 

24.416 

.305 

1000 

250 

24.439 

.312 

19.484 

.244 

2000 

250 

27.024 

.338 

17.326 

.217 

Table  7 


Intercorrelations,  Means,  and  Standard  Deviations  of  the 
Estimated  a  Parameters  (Lower  Triangle)  and  b  Parameters 
(Upper  Triangle)  for  Test  2 


Sample  Size 

Sample 

Size 

b 

250 

500 

1000 

2000 

Mean 

SD 

250 

.952 

.940 

.935 

.056 

.856 

500 

.757 

.978 

.969 

.059 

.838 

1000 

.690 

.860 

.986 

.074 

.870 

2000 

.595 

.803 

.926 

.056 

.873 

a 

Mean 

1.353 

1.254 

1.235 

1.227 

SD 

.484 

.335 

.325 

.306 

Discussion 

The  results  of  the  study  present  new  evidence  of  the  critical  interrela¬ 
tionship  between  item  calibration  and  equating  sample  sizes  and  the  values  of 
ICC  parameters. 


Estimating  and  Equating  a 


For  the  16  combinations  of  calibration  sample  sizes  and  equating  sample 
sizes  identified  in  Table  6,  the  least  deviation  of  estimated  a  from  its  known 
value  occurred  with  an  equating  sample  size  of  2,000  and  a  calibration  sample 
size  of  500.  As  mentioned  in  the  previous  section,  although  the  least  error 
between  an  estimated  and  known  a  value  was  expected  with  a  match  of  2,000  equat¬ 
ing  and  2,000  calibrating  sample  sizes,  error  actually  increased  very  slightly 
with  increasing  calibration  sample  sizes  beyond  500.  This  discrepancy  appar¬ 
ently  resulted  from  a  nonlinear  transformation  with  sample  sizes  of  250  and  500 
but  tended  toward  linearity  with  sample  sizes  of  1,000  and  2,000. 

During  equating  procedures  a  sample  size  of  greater  than  500  should  be  used 
to  ensure  an  acceptable  degree  of  confidence  that  the  estimation  of  a  does  not 
significantly  depart  from  its  "true"  value.  In  the  same  light,  estimation  of  a 
suffers  considerably  using  equating  sample  sizes  of  less  than  500  such  that 
equating  samples  of  1,000  or  2,000  are  highly  desirable  to  minimize  error  in 
estimating  a. 

Estimating  and  Equating  b 

Table  6  also  shows  the  linear  relationship  between  error  and  sample  size 
for  the  b^  parameter.  The  _b  parameter  was  best  estimated  with  calibration  and 
equating  samples  of  2,000  each,  although  a  calibration  sample  size  of  1,000  with 
an  equating  sample  size  of  500  can  be  tolerated  without  an  appreciable  increase 
in  error.  With  all  combinations  of  calibration  and  equating  sample  sizes,  b  was 
estimated  quite  well. 

Estimating  and  Equating  a 

The  flat  line  drawn  in  Figure  2  representing  the  data  from  Table  5  shows 
the  estimation  of  the  c_  parameter  to  be  nearly  insensitive  to  increases  in  sam¬ 
ple  size.  As  sample  size  increased  from  250  to  2,000  subjects,  error  decreased, 
but  only  very  slightly.  With  £  defined  as  the  lower  asymptote  of  the  ICC  and 
representing  the  probability  of  extremely  low-ability  examinees  correctly  an¬ 
swering  an  item,  the  inability  to  estimate  c  with  precision  could  be  disturbing. 
However,  it  has  been  pointed  out  (Lord,  1975)  that  if  a(0  -  b)  <  -2,  then  the 
probability  of  a  correct  response  is  c.  Therefore,  if  there  are  a  large  number 
of  subjects  with  ability  0  so  that  0  <  -(2/a  -  b),  c  can  be  accurately  estimat¬ 
ed.  If  this  requirement  is  not  met,  c  will  be  poorly  estimated. 

Conclusions 


A  stable  and  accurate  estimate  of  the  a  and  b^  parameters  requires  large 
numbers  of  subjects  over  a  broad  range  of  ability.  The  estimation  of  c_  requires 
large  numbers  of  subjects  at  very  low  ability  levels.  This  holds  for  both 
equating  and  calibrating  samples;  therefore,  it  is  necessary  to  administer  test 
items,  whether  to  be  calibrated  or  equated,  to  the  largest  samples  available. 
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Discussion:  Session  5 


Gail  Ironson 

Bowling  Green  State  University 


Two  key  questions  seem  to  arise  from  the  three  papers  on  item  linking  and 
equating.  Can  information  be  obtained  from  various  combinations  of  adminis¬ 
tering  subsets  of  items  to  samples  of  people  in  order  (1)  to  put  the  items'  pa¬ 
rameters  on  a  single  scale  and  (2)  to  put  all  examinees'  trait  estimates  on  a 
common  scale?  To  the  extent  that  these  two  questions  can  be  answered  affirma¬ 
tively,  different  forms  of  a  test  can  be  given  to  different  examinees.  For  the 
first  question,  dealing  with  the  invariance  of  the  parameter  estimates,  the  pa¬ 
pers  considered  sample  size,  context  effects,  and  the  choice  of  latent  trait 
model.  For  the  second,  which  deals  with  equating  the  traits,  there  can  be  a 
consideration  of  whether  latent  trait  theory  improves  equating  over  classical 
models  (and  in  what  circumstances),  a  consideration  of  the  problems  in  vertical 
versus  horizontal  equating,  and  a  comparison  of  the  efficacy  of  the  models. 

These  two  questions  will  be  discussed  interchangeably. 

At  various  points  in  the  studies  presented,  an  anchor  test  design  was  used. 
Two  different  groups  of  people  each  get  a  basically  different  set  of  items,  but 
some  of  the  items  are  the  same.  The  items  commonly  given  to  both  groups  are 
called  "anchor  items." 

The  question  of  parameter  invariance  will  be  considered  first.  Theoreti¬ 
cally,  the  item  characteristic  curve  (ICC)  parameters  estimated  using  different 
samples  of  subjects  should  be  invariant  within  a  linear  transformation.  Meas¬ 
ures  of  the  accuracy  of  the  invariance  of  parameters  are  typically  the  correla¬ 
tion  coefficient  and  some  type  of  error  function.  In  the  Ree  and  Jensen  study, 
an  anchor  subset  of  items  was  embedded  in  two  otherwise  different  tests  adminis¬ 
tered  to  two  different  samples  of  simulated  examinees.  Both  groups  of  simulated 
subjects  were  specified  to  have  about  the  same  normal  distribution  of  ability. 
Since  it  was  a  simulation  study,  the  true  ICC  parameters  were  used.  Ree  and 
Jensen  looked  at  both  the  correlation  coefficients  and  the  summed  absolute  devi¬ 
ations  of  estimated  parameters  from  the  true  parameters. 

There  are  two  overall  trends  that  have  been  noted  previously  in  the  litera¬ 
ture.  The  first  is  the  relative  accuracy  in  estimating  the  a,  b,  and  £  parame¬ 
ters.  The  second  concerns  the  sample  size,  and  here  there  are  some  consisten¬ 
cies  and  some  inconsistencies.  In  looking  at  the  correlations,  it  is  found  that 
the  correlations  for  the  b^  parameters  run  in  the  mid  90s,  even  with  a  small  sam¬ 
ple  size  of  about  250.  The  correlations  for  the  £  parameters  range  from  the  mid 
60s  to  the  mid  80s,  and  it  seems  that  the  effect  of  increasing  the  sample  size 
is  most  potent  for  this  parameter.  (Unfortunately,  the  exact  function  for  the  £ 
parameter  has  not  yet  been  determined.  Is  it  a  log  function,  a  linear  function. 


or  something  else?)  And  the  c  parameter  is  not  estimated  well  at  any  sample 
size,  the  correlations  ranging  from  .03  to  .31. 

With  respect  to  the  summed  absolute  deviations,  the  error  in  estimating  a 
is  largest,  but  it  seems  to  decrease  more  rapidly  than  for  the  other  parameters. 
The  error  for  b  is  less,  and  it  decreases  more  slowly.  And,  as  mentioned  above, 
even  with  an  increase  in  sample  size,  there  is  no  reduction  in  error  for  the  £ 
parameter.  Although  the  £  parameter  has  the  smallest  error,  the  error  is  about 
one-third  of  the  range,  so  that  the  relative  error  is  quite  large. 

In  Yen's  study  the  correlations  between  the  parameters  for  various  sample 
sizes  can  also  be  examined.  With  a  sample  size  of  200,  different  samples  took 
the  test  in  the  same  context;  in  other  words,  it  was  a  sample  that  was  just 
split  into  two.  These  were  Yen's  la  and  lb  samples.  (This  can  be  compared  to  a 
sample  size  of  250  in  the  study  by  Ree  and  Jensen.)  The  £  parameters  correlated 
.63  and  .78,  which  is  approximately  the  same  level  of  correlation  observed  pre¬ 
viously.  The  b  parameters  correlated  .87  and  .94  for  two  tests — Reading  and 
Math — and  the  Rasch  difficulty  parameters  correlated  higher,  .95  and  .98.  Her 
study  also  showed  that  the  change  in  the  context  substantially  decreased  the 
stability  of  the  parameters:  The  correlations  dropped.  This  was  particularly 
true  for  the  £  parameters.  As  an  example  of  a  change  in  context,  she  also  noted 
that  the  items  at  the  end  of  the  booklet  seemed  to  be  relatively  more  difficult 
than  when  they  appeared  at  the  beginning  of  the  booklet. 

In  comparing  the  1-parameter  and  the  3-parameter  models,  she  found  that  the 
Rasch  model  produced  slightly  more  stable  difficulties.  But  it  is  not  clear, 
according  to  Yen,  whether  the  parameters  of  one  of  the  models  were  affected  more 
by  changes  in  context  than  the  parameters  of  the  other  model. 

In  comparing  the  joint  effect  of  the  sample  size  and  the  effect  of  context, 
she  noted  that  increasing  the  sample  size  increased  the  correlations  between 
parameters  when  the  context  was  held  constant.  However,  when  pooling  over  dif¬ 
ferent  contexts  to  increase  the  sample  size,  lower  correlations  were  obtained 
compared  to  the  correlations  with  smaller  samples  in  the  same  context.  For  ex¬ 
ample,  correlations  based  on  N  =  600  with  items  in  different  contexts  were  lower 
than  correlations  based  on  an  N  =  400  in  the  same  context.  Thus,  as  the  sample 
size  increased,  the  correlations  increased,  but  thi  3  did  not  compensate  for  a 
change  in  context.  Yen  summarized  this  by  stating  that  the  X  and  Y  item  parame¬ 
ters  estimated  from  different  samples  were  more  highly  correlated  than  X  and  Y 
item  parameters  estimated  from  different  booklets  (i.e.,  different  contexts). 
Finally,  she  presented  an  excellent  discussion  of  possible  reasons  for  the  in¬ 
variance,  eliminating  several  reasons. 

It  is  possible  that  context  effects  may  also  be  present  in  the  study  by 
Marco,  Petersen,  and  Stewart.  They  found,  for  instance,  that  equating  may  be 
different  depending  on  whether  the  anchor  test  is  internal  or  external,  with 
more  error  if  the  anchor  test  is  external.  Though  there  are  many  reasons  why 
this  could  be  true,  one  of  them  might  be  the  change  in  context.  Since  they  did 
not  use  the  same  items  for  the  different  internal  and  external  tests,  that  may 
partially  explain  the  finding.  However,  context  effects  could  be  investigated 
in  that  regard. 
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Ree  and  Jensen  used  calibration  sizes  of  250,  500,  1,000,  and  2,000  and 
equating  sample  sizes  at  those  four  levels;  altogether  they  had  16  combinations 
of  calibration  sample  size  and  equating  sample  size.  They  looked  at  the  summed 
deviation  of  the  known  ICC  parameters  from  the  equated  values.  For  the  various 
calibration  sample  sizes,  the  equated  b  parameter  error  decreased  most  when  go¬ 
ing  from  500  to  1,000.  For  the  equated  £  parameter,  the  error  decreased  from 
calibration  sample  sizes  of  250  to  500  and  then  increased  surprisingly  as  the 
sample  size  went  up.  Although  I  do  not  have  the  answer  to  why  this  happened,  I 
think  it  is  a  question  that  needs  to  be  answered.  That  same  table  may  be  used 
to  look  at  the  equating  sample  size  as  well  as  the  calibration  size.  The  equat¬ 
ed  a  parameter  showed  smaller  deviations  as  the  sample  size  increased  except 
when  the  equating  sample  size  was  between  250  and  500,  where  it  increased.  The 
equated  b^  parameter  had  less  error  with  the  sample  of  500  than  with  either  250 
or  1,000.  Thus,  there  are  several  anomalies  in  that  same  table.  The  others 
could  be  seen  by  rearranging  the  table,  switching  the  calibration  and  equating 
sample  size. 

Another  observation  is  that  it  seems  to  be  more  important  to  keep  the  cali¬ 
bration  sample  size  above  250  in  estimating  the  a  parameter.  Errors  were  larger 
with  small  calibration  samples  and  large  equating  samples  than  with  large  cali¬ 
bration  and  small  equating  samples  when  sample  sizes  were  between  250  and  500. 
Finally,  in  looking  at  the  total  combined  effect,  it  seems  that  if  calibrating 
and  equating  samples  of  250  and  250  are  compared  to  calibrating  and  equating 
samples  of  2,000  and  2,000,  there  is  roughly  a  50%  decrease  in  the  error. 

The  second  question  was  concerned  with  a  comparison  of  the  models.  In 
Yen's  study  the  1-  and  3-parameter  models  were  used,  but  in  some  of  the  subsets 
of  items,  one  of  the  criteria  for  selection  was  fit  to  the  predictions  of  a 

2- parameter  logistic  model  (in  which  c^  =  0).  However,  it  should  be  noted  that 

there  were  certainly  items  that  were  selected  that  did  not  have  zero  <c's.  Also, 
in  two  of  the  subsets  of  items  she  allowed  items  that  did  not  fit  well.  For  the 
items  selected  on  the  basis  of  fit  to  the  2-pararaeter  model,  she  noted  that  this 
interestingly  »had  the  effect  of  discarding  items  that  had  the  worst  chi-squares 
for  both  the  1-  and  3-parameter  models.  The  items  retained  (Sets  A,  X  and  Y), 
however,  fit  the  3-parameter  model  better  than  the  1-parameter  model. 

For  small  sample  sizes,  an  N  of  200,  the  Rasch  model  generally  had  slightly 
more  stable  difficulty  estimates  than  the  3-parameter  model.  For  large  samples 
the  difficulties  were  stable  for  both  models.  However,  trait  equatings  based  on 
item  parameters  estimated  with  small  sample  sizes  were  so  poor  that  she  recom¬ 
mended  neither  model  be  used. 

The  two  models  differed  in  terms  of  the  types  of  errors  in  equating  of  the 
traits.  The  3-parameter  model  tended  to  produce  more  unsystematic  or  random 
error  than  the  Rasch  model  for  low  trait  values.  The  Rasch  model  had  greater 
systematic  errors  than  the  3-parameter  model.  The  Rasch  model  also  had  higher 
correlations  between  traits  and  lower  root  mean  square  differences,  but  the 

3- parameter  model  generally  had  closer  equatings  of  means  and  standard  devia¬ 
tions.  Yen  noted  that  this  seemed  to  be  essentially  the  result  of  relatively 
large  between-trait  differences  for  low  trait  values  estimated  by  the  3-parame- 
ter  model.  Finally,  the  3-parameter  model  made  better  predictions  in  cross-val¬ 
idation. 
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In  the  ETS  study  the  1-parameter  model  seemed  to  be  superior  to  the  3-pa¬ 
rameter  model  ICC  in  equating  a  test  to  itself.  However,  Marco,  Peterson,  and 
Stewart  noted  that  there  may  be  a  natural  bias  operating  here  because  the  ji's 
and  c's  were  fixed.  The  3-parameter  model  was  the  best  equating  model  when 
total  tests  of  unequal  difficulty  were  equated  through  a  medium  difficulty 
anchor  test  with  dissimilar  samples. 

The  question  might  be  asked,  which  method  seems  to  hold  more  promise  for 
some  of  the  problems  that  are  investigated  in  measurement,  for  instance,  hori¬ 
zontal  versus  vertical  equating?  In  an  ideal  horizontal  equating  study  the  test 
forms  would  be  at  comparable  difficulty  levels;  there  might  be  minor  unintended 
differences  in  ability  level  of  the  samples;  and  the  anchor  test  would  be  rough¬ 
ly  parallel  to  the  whole  test.  Under  these  conditions  the  conventional  methods 
seem  to  work.  In  fact,  some,  but  not  all,  of  these  conditions  are  necessary. 

For  example,  to  generalize  from  Petersen's  study,  when  a  test  is  equated  to  a 
test  like  itself  through  a  parallel  anchor  test,  then  the  linear  model  yields 
good  results  even  if  different  samples  are  used.  If  samples  of  different  abili¬ 
ty  are  taken  and  there  are  test  forms  at  comparable  difficulties,  and  if  the 
anchor  test  is  different  in  difficulty  from  the  total  test,  then  the  ICC  methods 
would  be  best — the  1-parameter  slightly  better  than  the  3-parameter  model. 

In  the  typical  situations  in  vertical  equating  there  would  be  two  test 
forms  that  would  be  different  in  difficulty,  and  groups  of  examinees  who  would 
normally  differ  in  ability  level.  This  is,  of  course,  a  more  difficult  problem. 
The  results  from  Yen's  paper  suggest  that  equating  errors  would  be  greater  for 
vertical  than  for  horizontal  equating  because  the  trait  estimates  based  on  items 
of  systematically  different  difficulty  were  less  well  equated.  However,  she  had 
the  same  ability  level  for  that  particular  result,  whereas  in  some  vertical 
equating  situations  there  would  be  different  ability  levels. 

The  results  from  the  Marco,  Petersen,  and  Stewart  study  suggest  that  the 
1-parameter  model  would  not  handle  vertical  equating  very  well.  The  study  found 
that  when  total  tests  differ  in  difficulty,  the  1-parameter  model  gave  unaccept¬ 
able  results  In  many  instances.  This  is  also  consistent  with  the  findings  of 
Slinde  and  Linn  (1977).  It  seems  that  the  3-parameter  model  holds  more  promise 
for  vertical  equating. 

In  summarizing  some  of  the  evidence  on  trait  equating  from  Yen,  it  should 
be  noted,  as  she  pointed  out,  that  obtaining  equated  trait  estimates  is  one  of 
the  most  important  tests  of  the  usefulness  of  latent  trait  models.  The  first 
finding  is  that  trait  estimates  based  on  items  of  approximately  equal  difficulty 
were  fairly  well  equated.  As  previously  mentioned,  the  Rasch  model  had  higher 
correlations  between  trait  estimates  and  lower  root  mean  square  differences, 
though  the  3-parameter  model  was  better  for  equating  the  means  and  standard  de¬ 
viations.  The  second  conclusion  that  Yen  came  to  was  that  trait  estimates  based 
on  items  of  systematically  different  difficulty  (i.e.,  an  easy  versus  a  hard 
test)  were  less  well  equated.  Third,  she  found  that  traits  can  be  equated  well 
despite  the  presence  of  context  effects  on  item  parameters  if  there  are  large 
samples. 

The  ETS  study  presented  by  Marco,  Petersen,  and  Stewart  examined  the  ade- 
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quacy  of  five  types  of  score-equating  models  when  certain  sample  and  test  char¬ 
acteristics  were  systematically  varied.  The  equating  models  were  linear  models, 
equipercentile  models,  and  ICC  models.  The  samples  were  either  random  or  dis¬ 
similar.  They  looked  at  several  variations  on  the  test  characteristics,  paying 
particular  attetntion  to  the  relationship  of  the  anchoring  items  to  the  whole 
test.  The  study  was  basically  divided  into  two  parts.  The  first  part  was 
equating  a  test  to  itself.  The  second  part  was  equating  a  test  to  a  different 
test.  For  the  first  part  Marco,  Petersen,  and  Stewart  looked  at  anchor  tests 
that  were  either  internal  or  external  and  at  anchor  tests  that  were  either  more 
difficult  or  easier  than  the  whole  test.  In  the  second  part — equating  a  test  to 
another  test  through  an  internal  anchor  test — they  examined  the  effects  of  the 
difficulty  of  two  total  tests  and  the  similarity  of  the  equating  samples.  They 
had  two  different  methods  of  obtaining  the  criterion  scores.  The  extent  to 
which  that  might  have  influenced  the  results  is  not  entirely  clear;  however, 
they  are  fully  cognizant  of  that  problem  and  did  say  that  the  findings  were  ten¬ 
tative. 

The  results  of  the  investigation  are  displayed  clearly  in  a  series  of  fig¬ 
ures  in  the  paper.  In  equating  a  test  to  itself,  they  found  that  for  medium 
difficulty  anchor  tests  of  similar  content  to  the  total  test,  the  best  linear 
model  had  the  smallest  total  error,  followed  by  the  1— parameter  and  3-parameter 
models.  They  also  found  that  if  the  anchor  test  is  almost  parallel  to  the  total 
tests,  the  difference  between  samples  is  not  so  important,  i.e.,  whether  it  is  a 
similar  or  dissimilar  sample.  Another  finding  was  that  there  was  less  error,  in 
general,  with  internal  anchor  tests  than  with  external  anchor  tests.  The  second 
part  of  the  exploration  of  equating  a  test  to  itself  examined  easy  and  difficult 
anchor  tests.  The  effect  of  similarity  of  the  samples  is  potent;  having  similar 
samples  is  more  important  if  the  anchor  test  difficulty  is  off-center,  compared 
to  the  total  test  difficulty.  The  second  point  is  that  linear  models  are  still 
best  if  the  samples  are  similar;  however,  ICC  models  are  superior  when  the 
anchor  test  is  different  in  difficulty  from  the  total  test  and  the  samples  dif¬ 
fer  in  ability. 

For  equating  a  test  to  a  different  test  (of  different  difficulty)  using  an 
internal  anchor  of  similar  content  and  medium  difficulty,  they  found  that  the 
smallest  error  was  present  for  the  3-parameter  model;  that  was  followed  by  the 
equipercentile  method.  There  was  little  effect  of  the  sample. 

Several  questions  have  been  raised  by  these  studies.  First  is  the  question 
why  the  increase  in  calibration  sizes  changed  the  accuracy  of  the  a_  parameter 
during  equating  in  the  Ree  and  Jensen  study.  Could  this  possibly  have  something 
to  do  with  the  Urry  program?  I  really  do  not  know. 

In  addition  to  asking  how  stable  the  parameter  estimates  are  and  how  well 
traits  can  be  estimated  on  a  common  scale,  we  ought  to  start  looking  at  the 
characteristics  of  those  items  whose  parameters  are  not  stably  estimated  and 
where  on  the  scale  things  are  not  working  properly  and  why.  The  c  parameter 
seems  to  be  particularly  difficult  to  estimate;  we  might  do  better  if  we  "stuf¬ 
fed  the  ends."  For  instance,  it  might  be  found  that  the  c  parameters  are  not 
stably  estimated  when  there  are  not  enough  low-ability  examinees.  It  might  also 
be  found  that  if  a  rectangular  distribution  of  ability  is  used,  more  stable  es- 


A 


timates  of  the  a  parameters  are  obtained  and  even  more  stable  estimates  of  the  J> 
parameters  might  be  obtained  for  those  items  that  are  off-center.  It  might  be 
asked  where  on  the  scale  things  are  not  working  properly  for  equating  of  traits, 
too.  For  example,  Yen  found  that  the  3-parameter  model  tended  to  produce  more 
unsystematic  or  random  error  than  the  Rasch  model  for  low  trait  values— which  is 
a  step  in  the  right  direction. 

The  next  question  is  one  raised  by  Marco,  Petersen,  and  Stewart:  To  what 
degree  are  their  results  influenced  by  the  criterion  equating  procedure?  They 
also  noted  that  in  some  of  the  cases  the  rank  order  remained  the  same,  which 
would,  of  course,  give  a  little  more  confidence  in  terms  of  the  generalizability 
of  the  results. 

We  have  looked  at  equating  under  various  conditions,  for  example,  the  char¬ 
acteristics  of  the  test,  the  anchor  items  and  their  relation  to  the  test,  and 
the  sample  characteristics.  Of  course,  we  could  continue  getting  every  possible 
combination  and  permutation  of  these,  and  this  would  at  least  keep  us  busy  until 
the  next  adaptive  testing  conference.  One  combination  that  Marco,  Petersen,  and 
Stewart  mentioned  was  the  effect  of  internal  versus  external  anchors  when  equat¬ 
ing  tests  of  different  difficulty.  We  might  also  look  at  the  length  of  the 
anchor  test  and  see  how  that  affects  equating. 

What  conclusions  are  to  be  drawn?  The  first  conclusion  is  that  under  opti¬ 
mum  conditions,  everything  works  well.  Of  course,  it  helps  to  have  a  few  thou¬ 
sand  people  at  one's  disposal.  Second  is  a  finding  that  has  been  repeated  over 
and  over  again.  The  _b  parameter  is  estimated  the  best,  then  the  a  parameter, 
and  the  c_  parameter  is  estimated  rather  poorly.  With  small  samples  it  was  found 
that  the  Rasch  difficulty  parameter  was  more  stably  estimated  by  the  Rasch  model 
than  by  the  3-parameter  model.  A  third  conclusion  is  that  the  changes  in  con¬ 
text  have  substantial  effects  on  item  parameters,  but  if  there  is  a  large  sample 
size  for  estimating  parameters,  trait  equatings  usually  are  good. 

Fourth,  as  we  already  know,  vertical  equating  seems  to  be  more  of  a  problem 
than  horizontal  equating.  Fifth,  using  various  combinations  of  the  whole  test 
characteristics,  the  anchor  test  characteristics,  and  the  samples,  the  condi¬ 
tions  under  which  the  various  equating  procedures  work  best  have  been  described; 
there  is  now  a  source  to  consult  to  find  out  the  best  procedure  under  given  cir¬ 
cumstances. 

As  1  mentioned  before,  the  papers  in  this  session  shed  light  on  two  major 
questions  of  interest:  To  what  extent  are  the  parameters  invariant?  and  To 
what  extent  can  traits  be  estimated  by  different  items  and  be  put  on  a  common 
scale?  Obviously,  we  still  have  a  long  way  to  go,  but  I  think  we  are  moving  in 
the  right  direction. 
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A  Model  for  Incorporating  Response-Time  Data  in  Scoring 
Achievement  Tests 


Kikumi  Tatsuoka  and  Maurice  Tatsuoka 
University  of  Illinois 


The  study  by  Tatsuoka  and  Blrenbaum  (1979)  raised  an  Important  issue  with 
respect  to  adaptive  diagnostic  testing  and  computer-managed  routing  by  which 
each  examinee  is  sent  to  his/her  level  of  instruction:  that  it  is  necessary  to 
consider  an  alternative  scoring  procedure  in  which  individual  differences  in 
information-processing  skills  are  taken  into  account  along  with  individual  abil¬ 
ity  or  achievement  levels. 

In  Tatsuoka  and  Blrenbaum' s  study  a  computerized  diagnostic  adaptive  test 
for  a  series  of  pre-algebra  signed  number  lessons  was  given  to  eighth  graders  at 
a  junior  high  school,  and  a  computer-managed  routing  system  sent  each  examinee 
to  the  instructional  unit  corresponding  to  the  level  of  skill  that  he/she 
reached  in  the  initial  test.  The  adaptive  test  for  signed  numbers  consisted  of 
12  groups  of  items  representing  12  different  skills.  The  instructional  units  of 
computerized  lessons  teaching  the  same  12  skills  were  rearranged  into  the  same 
order  as  the  skills  in  the  adaptive  test,  so  that  if  an  examinee  stopped  at  the 
7th  skill  level,  he/she  was  sent  to  the  7th  level  of  the  lessons.  After  the 
student  went  through  the  7th  to  12th  instructional  units,  a  52-item  conventional 
computerized  posttest  was  administered. 

Factor  analysis  revealed  that  the  test  scores  of  the  posttest  did  not  sat¬ 
isfy  the  assumption  of  local  independence,  i.e.,  unidimensionality.  A  further 
close  investigation  was  performed  by  a  cluster  analysis  on  the  92  examinees' 
response  patterns  on  the  basis  of  Euclidean  distances  between  pairs  of  response 
vectors.  The  result  of  this  analysis  led  to  finding  a  group  of  students  whose 
response  patterns  were  significantly  different  from  others.  Their  scores  on  the 
items  prior  to  the  stopping  level  of  the  initial  diagnostic  test  were  higher 
than  most  scores  of  other  students,  but  their  scores  on  the  subsequent  items 
were  as  low  as  the  poorest  students'  scores.  It  was  confirmed  with  their  teach¬ 
ers  that  most  of  them  were  actually  ”A"  students.  It  was  also  confirmed  that 
the  members  of  this  group  were  taught  signed  number  addition  operations  by  a 
teaching  method  different  from  that  of  the  subsequent  instructional  units,  which 
teach  subtraction  operations.  The  procedures  of  information  processing  associ¬ 
ated  with  these  two  instructional  methods  of  performing  arithmetic  upon  signed 
numbers  are  greatly  different.  The  traditional  scoring  procedure  of  latent 
trait  theory  would  not  be  capable  of  detecting  these  discrepancies  associated 
with  different  information  processes  for  arriving  at  the  answers  to  a  given 
item. 


A  study  by  Tatsuoka  and  Tatsuoka  (1978)  Indicated  one  useful  approach  to¬ 
ward  the  goal  just  mentioned.  It  showed  that  under  certain  general  conditions, 
item  response  time  scores  very  closely  follow  Weibull  distributions — a  3-parame¬ 
ter  family  extensively  used  in  system  reliability  theory  (see,  e.g.,  Mann, 
Schafer,  &  Singpurwalla,  1974).  The  most  interesting  of  the  three  parameters  is 
the  shape  parameter,  whose  magnitude  determines  the  nature  of  the  conditional 
response  rate,  that  is,  the  conditional  probability  that  an  examinee  who  has  not 
responded  to  an  item  up  to  time  t  will  respond  to  it  within  an  infinitesimally 
short  time  interval  thereafter.  A  brief  note  on  the  mathematical  and  conceptual 
backgrounds  of  the  Weibull  distribution,  introduced  in  the  study  of  Tatsuoka  and 
Tatsuoka  (1978),  will  be  described  in  the  following  section. 

As  a  follow-up  to  the  Tatsuoka  and  Birenbaum  (1979)  study,  Weibull  distri¬ 
butions  were  fitted  to  every  item  in  the  posttest.  The  Weibull  fit  of  almost 
all  items — 14  items  on  addition  that  were  taught  prior  to'  the  students'  exposure 
to  the  PLATO  lessons — was  quite  poor  when  the  fitting  was  done  for  the  total 
sample.  However,  the  separate  fits  in  two  groups,  which  had  earlier  been  iden¬ 
tified  as  having  distinctly  different  instructional  backgrounds,  were  very  good 
for  all  14  items  (see  Appendix  Tables  A-l ,  A-2 ,  and  A-3).  Further,  it  was  found 
that  the  value  of  the  shape  parameter  c  differed  considerably  in  the  two  groups 
for  each  item,  being  higher  in  one  group  for  some  items  and  lower  for  others. 
That  is,  there  was  a  Task  x  Instructional  Method  interaction  effect  on  the  shape 
parameter  c. 

The  foregoing  suggests  that  the  Weibull  shape  parameter  can  assist  in  the 
identification  of  Items  that  are  sensitive  to  particular  information-processing 
skills.  After  identifying  and  constructing  such  discriminating  items,  it  was 
anticipated  that  an  index  known  as  person  conditional  response  rate,  to  be 
developed  below,  could  be  used  for  postdicting  the  instructional  background  of 
students  and  routing  them  accordingly. 

Rationale  of  Weibull  Distributions 


Measuring  the  time  needed  to  achieve  a  given  goal  (that  is,  response  time) 
Is  easy  in  computer-managed  testing;  but  since  it  is  imposs’ble  to  collect  accu¬ 
rate  response  time  data  in  paper-and-pencil  testing,  it  has  not  been  utilized 
thus  far  in  the  realm  of  practical  application  of  psychometrics.  Tatsuoka  and 
Tatsuoka  (1978)  have  studied  the  statistical  aspects  of  response  time  distribu¬ 
tions  and  their  characteristics  as  associated  with  test  items. 

There  are  a  number  of  theoretical  distributions  by  which  the  response  time 
data  may  seem  to  be  fitted  well,  so  it  is  necessary  to  follow  some  guidelines  as 
to  what  sort  of  distribution  might  be  appropriate  to  represent  a  set  of  response 
times  for  a  given  item.  Rasch  (1960)  used  the  2-parameter  gamma  distribution  as 
a  model  for  the  time  taken  to  read  a  passage  of  N  words  and  the  Poisson  process 
as  a  guide  to  his  model.  The  occurrence  of  a  response  is  a  random  event,  and 
all  the  random  events  were  assumed  to  be  of  the  same  kind.  Rasch  was  interested 
in  their  total  number. 


238  - 


Application  to  Ability  Testing 

Sato  (1975)  and  others  introduced  the  Weibull  distribution,  which  has  been 
used  extensively  in  the  context  of  system  reliability  theory.  Reliability  theo¬ 
ry  is  the  study  of  the  probability  of  failure,  within  a  given  time  span,  of  a 
mechanical  or  electronic  system  as  a  function  of  the  probabilities  of  failure  of 
individual  components  of  the  system.  The  justification  for  utilizing  a  distri¬ 
bution  from  such  an  alien  field  is  that  the  test  item  is  identified  with  the 
system  whose  longevity  is  being  assessed.  The  student's  attacks  on  the  item 
correspond  to  the  shocks  or  wear  and  tear  to  which  the  system  is  subjected,  and 
the  eventual  solution  of  the  item  is  the  failure  of  the  system.  It  is  plausible 
to  imagine  the  student  to  be  intent  on  cracking  the  system  by  answering  the  item 
correctly.  The  time  he/she  takes  in  doing  so,  the  response  time,  corresponds  to 
the  "survival  time"  of  the  system.  This  rationale  for  the  applicability  of  Wei¬ 
bull  distributions  for  item  response  time  does  not  lead  to  a  derivation  of  the 
distribution  or  the  density  function.  Mann  et  al.  (1974)  and  others  have  said 
that  the  distribution  was  empirically  discovered,  rather  than  deductively  de¬ 
rived.  Later,  a  logical  basis  was  postulated  as  an  ex  post  facto  rationaliza¬ 
tion,  and  it  added  greatly  to  the  credibility  of  the  distribution  in  the  theory 
of  system  reliability.  This  is  the  concept  of  hazard  rates,  which  is  essential¬ 
ly  the  conditional  probability  that  a  system  that  has  survived  through  time  t^ 
will  fail  during  an  infinitesimal  time  interval  immediately  after  that. 

Conditional  Response  Rate 


A  similar  concept,  conditional  response  rate  ( CRR) ,  was  introduced  in  this 
study  as  a  logical  basis  for  use  of  the  Weibull  distribution.  Suppose  f(t)  is 
the  probability  density  that  a  person  randomly  selected  from  the  population  will 
respond  to  a  given  item  during  the  interval  [t,  t  +  dt].  Then,  the  proportion 
of  individuals  who  will  have  responded  to  the  item  by  time  _t  is  the  probability 

distribution  function  F(t)  =  j  f(u)du.  The  proportion  of  individuals  who  have 

not  responded  to  the  item  by  time  £  is  1  -  F(t^).  Consequently,  the  conditional 
probability  density  that  a  person  will  respond  to  the  Item  during  the  interval 
[£,  ^  +  dt] ,  given  that  he  or  she  has  not  responded  to  the  item  up  to  time  t  is 
given  by  £(t)/[ 1  -  F(t)]. 

By  assuming  CRR  as  a  function  of  time  _t  to  be  monotonically  increasing,  or 
decreasing,  as  a  power  function  of  t,  the  Weibull  distribution  and  density  func¬ 
tions  can  be  expressed  as  follows: 


0 


and 


[2] 
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where 

£(>0)  is  the  shape  parameter, 
u(>0)  is  the  scale  parameter,  and 
£0(>O)  is  the  location  parameter. 

Figure  1  shows  several  Ueibull  distributions. 

Figure  1 

Weibull  Density  Functions  with  t o  “  2 
y0  =  15,  and  Four  Values  of  c. 

fit)  ~  (<?/Ug  )  it-tQ)°  1  x  exp [-( (t-tQ) /y Q)C] 


If  c  «  1,  then  f(t)  is  a  negative  exponential  density  function.  If  c  is 
less  than  1,  then  fit)  is  a  monotonically  decreasing  function.  The  Weibull  den¬ 
sity  function  is  symmetric  when  c_  is  about  3.6.  Figure  2  is  a  CRR  function  ob¬ 
tained  from  live  data.  CRR  1  in  Figure  2  is  the  CRR  when  c  was  larger  than  1; 
the  decreasing  dot  graph  (CRR  2)  was  obtained  from  the  distribution  when  c_  was 
less  than  1.  When  c_  =  1,  CRR  becomes  a  straight  line  (CRR  3)  that  is  parallel 
to  the  time  axis. 

Goodne  s  s-of-F i t-Te  s  t  s 


Figures  3  and  4  show  the  displays  of  goodness-of-f it  tests  with  the  normal 
and  Weibull  distributions.  The  step  function  represents  the  cumulative  distri¬ 
bution  of  a  set  of  response  times  to  a  matrix  multiplication.  The  continuous 
line  stands  for  the  estimated  theoretical  distribution  function.  The  Weibull 
distribution  fits  the  data  better  than  does  the  normal  distribution.  About  700 
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Figure  2 

Three  Types  of  Conditional  Response  Rate  Function 
F'(t)/[ 1  -F(t)],  F(t)  =  Weibull  Distribution  Function 


Time  in  Seconds 

cases  of  the  goodness-of-f it  test  were  carried  out,  and  most  data  fitted  either 
the  Weibull  or  3-parameter  gamma  distributions. 

Theoretical  distributions  were  fitted  to  the  observed  response  time  distri¬ 
bution  of  each  item  in  two  ways:  (1)  for  the  subgroup  of  students  who  answered 
the  item  correctly  (OK  subgroup)  and  (2)  for  the  subgroup  of  students  who  an¬ 
swered  the  item  incorrectly  (NO  subgroup).  The  OK  subgroup  and  the  NO  subgroup 
had  considerably  different  estimated  Weibull  parameters,  but  both  showed  very 
good  fits  for  most  items.  Figure  5  shows  the  estimated  Weibull  distributions  of 
the  OK  subgroup  and  the  NO  subgroup  for  an  item  in  the  pretest  that  required 
matrix  multiplication. 

The  Weibull  parameter  c  of  the  OK  subgroup  in  a  48-item  matrix  algebra  pre¬ 
test  correlated  .32  with  the  numbers  of  options  in  the  item  and  .41  with  the 
difficulty  indices.  The  items  with  more  choice  options  tended  to  have  large  c 
values.  If  the  interpretation  that  the  item  £  value  reflects  the  degree  of  en¬ 
gagement  students  show  when  the  item  is  correct  (Tatsuoka  &  Tatsuoka,  1978),  it 
may  be  concluded  that  within  the  range  represented,  the  larger  the  number  of 
options,  the  greater  the  engagement  students  feel.  This  seems  reasonable,  since 
items  with  more  options  present  more  of  a  cognitive  task  and,  hence,  probably 
induce  greater  involvement  on  the  part  of  the  students.  About  10  items  in  the 
test  asking  mathematical  properties  of  orthogonal  transformations,  eigenvalues, 
and  eigenvectors,  were  very  difficult  for  many  students  in  the  course.  These 
items  tended  to  have  the  smaller  Weibull  shape  parameters  £  in  both  OK  and  NO 
subgroups.  A  similar  observation  was  obtained  from  the  64-item  signed  number 
pretest. 


The  3-parameter  gamma  distributions  fit  well  the  items  that  repeatedly  re- 


Figure  3 

Goodness-of-Fit  Test  for  the  Time  Data  and  Weibull 
Distribution  Function  for  Question  17 

t0  =  0.7554,  max.  corr.  *  0.9813,  =  1.266,  tau  *  429.9 


x  axis  response  times  ( 3  to  840 ) 

quired  a  simple  mechanical  task,  whereas  the  Weibull  distributions  fit  well  the 
items  that  required  a  higher  cognitive  task  to  respond  to.  Since  the  CRR  of  the 
gamma  distributions  is  always  nondecreasing,  that  is,  either  monotonically  in¬ 
creasing  or  parallel  to  the  time  axis  (see  Appendix  B) ,  the  interpretation  of 
the  Weibull  shape  parameter  (see  Figure  2)  provides  wider  applicability  than  the 
gamma  shape  parameter  does.  Moreover,  the  parameter  estimation  routine  by  maxi¬ 
mum  likelihood  usually  failed  to  give  convergent  estimated  gamma  parameters  when 
items  had  decreasing  CRRs. 

Latent  Response  Tima  Model 

Latent  Response  Time  Variable  and  Item  Response  Time  Characteristic  Curve 

As  a  first  step  toward  developing  the  person  conditional  response  rate 
(PCRR),  the  existence  of  a  latent  response  time  variable,  analogous  to  the  abil¬ 
ity  variable  0  in  latent  trait  theory,  is  postulated.  Thus,  given  a  set  of  n 
items,  the  performance  on  which  is  affected  by  0,  it  is  assumed  that  there  also 
exists  a  variable  affecting  the  time  taken  by  an  examinee  to  answer  each  of 
these  items.  There  will  be  no  attempt  to  give  any  precise  psychological  meaning 
to  this  construct  beyond  saying  that  it  may  be  regarded  as  a  pervasive  trait  of 
individuals  to  be  slow  or  quick  in  solving  items  of  a  certain  domain. 
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Figure  4 

Goodness-of-Fit  Test  for  the  Response  Time  Data  and 
Normal  Distribution  for  Question  17 


fx  =  117.2,  <r  =  128.3 


x  axis ;  response  times  (  3  to  840 ) 

The  plausibility  of  this  postulation  is  suggested  by  the  following  empiri¬ 
cal  findings.  In  the  Tatsuoka  and  Tatsuoka  (1978)  study,  the  performance  scores 
on  a  48-item  matrix  algebra  test  were  found  to  have  a  strong  tendency  toward 
unidimensionality.  At  the  same  time,  the  response  times  for  these  items  showed 
a  suggestion  of  unidimensionality  by  the  scree  test.  On  the  other  hand,  the 
posttest  for  the  signed  number  lessons  mentioned  earlier  showed  no  semblance  of 
unidimensionality  in  the  total  sample.  However,  when  one  instructional  back¬ 
ground  group  identified  by  cluster  analysis  (hereafter  called  Group  2)  was  re¬ 
moved,  both  performance  scores  and  response  times  came  somewhat  closer  to  being 
unidimensional  in  the  remaining  sample. 

On  the  strength  of  these  observations  and  of  the  fact,  mentioned  earlier, 
that  the  Weibull  distribution  fits  the  response  time  data  for  most  items,  a  mod¬ 
el  for  item  response  time  is  developed  in  the  following  manner,  roughly  paral¬ 
leling  latent  trait  theory. 

Let 


[3] 


be  the  deviation  of  individual  r"s  response  time  _t^  for 


item  £  from  his  or  her 


I 

1 
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Figure  5 

Weibull  Distribution  and  Density  Function  for  Question  16 
(Vertical  Scale  for  f(t)  is  Magnified  by  a  Factor  of.  M Q) 

t 


1  =  ok  subgroup  only 


c 

1.161 


'0 

5.4740 


^0 

38.47 


mean 

36.51 


mean  response  time  over  the  given  set  of  n  items. 


Now 


x ^  may  be  conceptualized 


as  the  expectation  E(t^g)  of  that  person's  response  times  over  an  infinite  num— 


8 


Ti  +  ilg  18  aP~ 


ber  of  items  of  the  same  type  as  those  in  the  set  of  n.  Then, 

proximately  equal  to  t_^  ,  so  if  i^  varies  across  the  population,  it  is  reasonable 

to  assume  that  T.  +  d4„  follows  a  Weibull  distribution  just  as  t.  does, 
fore,  1  -1* 


There— 


Fd  (T) 

9 


1  -  exp 


t  +  d 


14] 


is  defined  as  the  response  time  characteristic  function  (RTCF)  for  item  where 
_to  ■  0  in  the  general  expression  Equation  2  for  the  Weibull  distribution  func¬ 
tion  to  simplify  the  task  of  parameter  estimation.  This  is  interpreted  to  re¬ 
present  the  probability  that  a  person  whose  latent  response  time  is  t  will  ar¬ 
rive  at  the  answer  to  item  £  at  or  prior  to  time  T  +  d  . 
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For  estimating  the  two  item  parameters  e  and  u_  as  well  as  the  person 

■“8  "o 

parameter  T  ^ ,  the  density  function  corresponding  to  Equation  4  is  written  in 

accordance  with  Equation  1  (with  _t0  set  equal  to  0),  for  each  person  j.,  and  the 
product  over  all  items  and  those  individuals  who  got  the  item  correct  is  formed 
to  obtain  the  likelihood  function.  That  is, 

n  Ng 

l  =  n  n  f( t 

g=l  i= 1 

a  h  .  ,  d . 

9 1  i  ±  M. 

u  \  u 

g\  g 

where  Ng  is  the  number  of  subjects  in  the  OK  subgroup  for  item  jj. 

Before  going  to  the  next  step  of  developing  the  PCRR  function  G(t),  note 
how  T  itself,  once  estimated,  can  help  in  the  task  of  postdicting  a  student's 
instructional  background.  Suppose  there  are  two  items  that  differentiate  be¬ 
tween  two  prior  instructions — A  and  B — by  actually  showing  a  reversal  in  the 
magnitude  order  of  mean  times  required  for  their  solution  by  examinees  who  were 
previously  taught  by  these  two  methods.  Table  1  shows  the  mean  response  time 
(also  with  the  estimates  of  Weibull  parameter  c_  and  CRR)  of  14  items  described 
earlier  for  the  two  groups,  the  prior  instructional  methods  of  which  were  A  and 
B,  respectively. 


Table  1 

Means  of  Response  Time,  Observed  Conditional  Response  Rate 
at  Mean,  and  Weibull  Shape  Parameters  of  Addition  Problems 
of  64-Item  Signed  Number  Test 


Weibull 

Mean  Response  Time  CRR  at  Mean  Shape  Parameter 


Item 

Others 

Group  2 

Others 

Group  2 

Others 

Group  2 

3 

9.84 

13.14 

0.13 

0.07 

1.45 

.80 

4 

7.61 

5.13 

0.13 

0.20 

0.99 

1.01 

14 

14.99 

18.48 

0.10 

0.08 

1.78 

1.68 

17 

6.39 

8.60 

0.19 

0.11 

1.35 

0.86 

18 

8.35 

9.00 

0.17 

0.10 

1.68 

0.90 

19 

7.16 

8.44 

0.18 

0.10 

1.47 

0.75 

28 

11.78 

11.89 

0.10 

0.13 

1.25 

1.96 

31 

8.70 

10.85 

0.12 

0.09 

1.00 

0.91 

32 

9.65 

5.43 

0.09 

0.23 

0.84 

1.43 

33 

4.62 

7.22 

0.28 

0.17 

1.49 

1.46 

42 

14.28 

14.85 

0.10 

0.07 

1.76 

1.16 

46 

10.08 

9.83 

0.10 

0.10 

0.93 

1.03 

47 

6.22 

11.55 

0.17 

0.07 

1.05 

0.63 

56 

10.56 

9.69 

0.14 

0.14 

1.87 

1.64 

__  * 
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Let  the  means  of  Items  32  and  33  be  taken  as  an  example,  then 
_tjA  -  9.65  sec.,  £lB  -  5.43  sec. 

£zA  *  4.62  sec.,  £2g  -  7.22  sec. 

Given  these  data  and  the  observed  response  times  and  t_2  for  the  two 
items  of  a  person  about  whom  there  is  no  other  information,  a  natural  but  sim¬ 
ple-minded  decision  rule  for  postdicting  his/her  instructional  background  would 
be  to  choose  A  if  ,  and  B  otherwise.  The  problem,  of  course,  is  that  the 

magnitude  order  of  the  two  observed  times  could  be  reversed  from  the  "true"  or¬ 
der  by  errors  of  measurement.  Knowledge  of  the  person's  x^  may  help  increase 

confidence  in  the  postdiction,  using  the  sequential  decision  rule  shown  in  Fig¬ 
ure  6,  again  a  deliberately  simple-minded  one. 


Figure  6 

Sequential  Decision  for  Postdicting  Method  A  or  B  Based  on 
Knowledge  of  T  and  Response  Times  for  Items  1  and  2. 


Give 
Item  1 


T<6.73 


tj  >  9.65  ■ 


Decide  A 


t2  >  7.22  ■ 


Decide  B 


tj<9.65  — >■ 

Give 

^  . 

Item  2 

Decide  A 

t2  <  7.22  < 

Decide  B 

T  >  673 


tj  <  5.43 


tj  >5.43 


Decide  B 


Give 
Item  2 


jt2<  4.62 


Decide  A 


^t2>  4.62 


Decide  A 

Decide  B 

First,  only  Item  1  is  administered  to  this  person.  Now  suppose  his/her  x^ 

is  less  than  6.73  sec.  (the  mean  of  the  four  mean  response  times  listed  above). 
Then,  if  _tj  ^9.65,  A  is  chosen  and  testing  is  terminated.  If,  on  the  other 
hand,  ^  <  9.65,  Item  2  is  then  administered,  and  B  is  chosen  if  t^2  >  7.22;  oth¬ 
erwise,  A  or  B  is  chosen  accordingly  as  Jtj  >  tr,  or  <  t^ ,  respectively.  When 
the  person's  Xj  is  greater  than  or  equal  to  6.73,  the  sequential  decision  will 

be  the  dual  of  the  above.  Namely,  if  _tj  <  5.43,  B  is  chosen;  if  jij  >  5.43,  Item 
2  is  further  administered,  and  A  is  chosen  if  t,  <  4.62.  If  ^2  >  4.62,  A  or  B 
is  chosen  according  to  the  magnitude  order  of  t^  and  t, . 


Refinements  to  this  simple  approach  would  include  getting  CRR  distributions 


•>* 
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for  each  item,  given  instructional  background  and  the  value  of  T.  Further,  with 
a  suitable  assumption  concerning  the  distribution  of  t,  the  posterior  probabili¬ 
ty  for  each  instructional  background  could  be  derived  given  T  and  the  magnitude 
for  each  order  of  _tj  and  t x  .  With  more  than  two  instructional  backgrounds  and  a 
larger  number  of  discriminating  items,  the  magnitude  order  of  two  response  times 
would  be  generalized  to  a  vector  of  response  times  exhibiting  different  pat¬ 
terns,  i.e.,  permutation  of  the  magnitudes  of  the  elements. 

Person  Conditional  Response  Rate  (PCRR) 


Some  discussion  is  in  order  to  explain  why  the  Weibull  family  was  chosen 
over  the  gamma,  despite  the  latter's  having  a  longer  tradition  of  usage  in  re¬ 
sponse  time  models  (e.g.,  Rasch,  1960;  Restle  &  Davis,  1962).  First,  the  gamma 
distributions  are  indicated  when  distinct  stages  are  identifiable  in  the  process 
of  solving  the  tasks,  in  which  case  c  must  be  an  integer  representing  the  number 
of  stages.  Second,  the  shape  parameter  c  of  the  Weibull  family  has  the  inter¬ 
esting  feature  of  apparently  distinguishing  between  different  information-pro¬ 
cessing  skills  associated  with  different  instructional  backgrounds.  This  fea¬ 
ture  is  no  doubt  related  to  the  fact  that  the  magnitude  of  £  (i.e.,  whether  £  is 
greater  than,  equal  to,  or  less  than  1)  determines  the  nature  of  the  item  condi¬ 
tional  response  rate  function  (ICRR),  which  describes  whether  perseverance  in¬ 
creases  the  chances  of  an  examinee's  responding  to  an  item,  whether  responses 
occur  at  random  times,  or  whether  a  point  of  diminishing  returns  is  reached  ear¬ 
ly.  In  other  words,  it  can  be  said,  as  mentioned  in  the  first  section  on  the 
rationale  of  Weibull  distributions,  that  £  is  sensitive  to  the  degree  of  in¬ 
volvement  students  show.  Two  different  instructional  methods  usually  require 
different  steps  of  information-processing  skills;  thus,  each  method  requires  a 
different  degree  of  involvement  in  solving  a  given  item.  For  example,  some 
items  in  Table  1,  such  as  "-11+10  ■  ?”  in  the  signed  number  posttest,  yield  not 
only  different  values  of  £  but  also  significantly  different  mean  response  times, 
depending  on  whether  the  sequential  or  number-lines  method  is  used  for  answer¬ 
ing,  as  dictated  by  the  examinee's  instructional  background.  Moreover,  the  con¬ 
venient  ICRR  function  is  readily  expressed  in  closed  form  for  a  Weibull  distri¬ 
bution  but  cannot  be  so  expressed  for  a  gamma  distribution,  because  the  incom¬ 
plete  gamma  function  cannot  be  expressed  analytically. 


The  ICRR  function  is  the  probability  that  an  examinee  who  has  not  responded 
to  an  item  by  time  £  will  do  so  within  an  infinitesimal  time  interval  thereaf¬ 
ter.  When  item  response  times  follow  a  Weibull  distribution,  this  function 
H  (_t)  is  given  by  £^(£)/[l  -  F  (£)],  where  fL(t)  and  F  (£)  are  expressions  Equa- 

tions  1  and  2  with  the  parameters  subscripted  with  a  g  for  item  £  and,  in  this 
case,  £0  set  equal  to  0.  Hence, 


16] 


From  this,  the  transition  to  PCRR  is  made  in  a  manner  analogous  to  going 
from  an  item  characteristic  curve  (ICC)  to  a  person  characteristic  curve  (PCC), 
first  suggested  by  Mosler  (1940,1942),  recently  by  Weiss  (1973;  Vale  &  Weiss, 
1975)  and  discussed  in  greater  detail  by  Lumsden  (1978)  and  Trabin  and  Weiss 


-  247  - 


(1979).  In  their  case,  for  each  individual  a  plot  is  made  of  the  proportions  of 
items  of  varying  difficulty  (represented  by  the  horizontal  axis)  that  are  passed 
by  that  individual.  In  the  present  case,  however,  the  ordinate  at  each  point 
along  the  horizontal  axis  representing  the  mean  response  time  of  an  item  would 
be  the  value  of  Hg(x),  where  x  is  the  latent  response  time  for  the  particular 

person,  computed  from  Equation  6  with  the  parameter  values  proper  to  that  item 
substituted  for  Ug  and  c^.  Note  that  when  shape  parameter  c  equals  1.0  for  all 

items,  the  PCCR  curves  are  identical  for  all  persons.  Thus,  utilizing  the  nega¬ 
tive  exponential  distribution  (i.e.,  a  special  case  of  the  Weibull  distribution 
functions)  for  this  purpose  will  not  work. 

Equation  6  defines  a  function  whose  curve  characterizes  the  behavior  of 
item  j;  over  time  in  terms  of  the  probability  of  reaching  an  answer.  The  steeper 
the  slope  of  a  curve  is,  the  greater  the  chance  that  item  g^  will  be  solved  as 
time  elapses.  The  steepness  of  the  curves  is  a  characteristic  attributed  to  a 
given  item,  similar  to  the  item  discrimination  index  in  latent  trait  theory.  A 
pseudo-CRR  function  on  variable  x  can  be  defined  as  follows: 


"VT) 


[71 


Similarly,  a  pseudo-PCRR  function  is  given  as  a  function  on  a  set  of  Equations 
7,  H.  (x),  jp*l, . . . ,n.  For  a  fixed  person  i, 

8 

ai<-ud.  )  "  Hd  'V'0'1 . "■  1 

9  9 


It  should  be  noted  that  the  x  in  Equations  7  and  8  is  merely  an  arbitrary  time 
point  and  bears  no  relation  to  a  person's  latent  response  time  (except  for  coin¬ 
ciding  in  numerical  value).  Only  in  the  context  of  the  random  variable  x  +  cl 

O 

does  x  have  the  sense  of  latent  response  time;  but  to  use  x  +  dg  as  the  argument 

of  H.  (x)  would  be  meaningless,  because  T  +  d  is  approximately  the  person's 
8 

observed  response  time  for  item  g,  and  it  would  be  a  contradiction  in  terms  to 
speak  of  the  person's  responding  in  the  next  moment,  given  that  he/she  has  not 
responded  up  to  the  actual  time  point  at  which  he/ she  did  respond. 

The  above  remarks  indicate  that  the  particular  approach  attempted  here  for 
defining  PCRR  was  futile,  but  not  that  the  concept  of  PCRR  itself  is  meaning¬ 
less.  An  alternative,  more  justifiable  approach  might  be  to  transform  response 
time  to  an  approximate  normal  variable.  The  transformation  derived  by  the  usual 
method  of  obtaining  variance-stabilizing  transformations  was  unusable  because  it 
was  an  arcsine  transformation  whose  argument  could  exceed  one.1  Therefore,  the 
usual  method  was  extended  by  taking  up  to  the  second  term,  instead  of  only  the 
first,  in  the  Taylor  series  expansion  on  which  the  transformation  is  based.  The 


‘This  fact  was  noticed  and  pointed  out  by  Jim  Paulson  at  the  1979  Computerized 
Adaptive  .Testing  Conference. 


result  was  that  the  transform  y  Is  the  solution  of  the  following  rather  formida¬ 
ble  differential  equation: 


where 


hit) iy" 

V 
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and  c  is  an  arbitrary  positive  constant.2  If  it  is  further  assumed  that  T  is 
normally  distributed  (which  seems  reasonable  by  virtue  of  the  central  limit  the¬ 
orem,  since  is  a  person's  mean  response  time  over  an  infinite  set  of  items, 

which  may  be  regarded  to  exhibit  local  independence  if  unidimensionality  holds), 
then  y  and  T  would  jointly  follow  a  bivariate  normal  distribution.  Hence,  if 
their  correlation  p  can  be  estimated  (roughly  analogous  to  communality  estima¬ 
tion  in  factor  analysis),  the  joint  distribution  would  be  uniquely  determined. 
From  this  and  the  distribution  of  T,  the  conditional  distribution  of  y^  given  I 
can  be  determined.  All  quantities  associated  with  persons  having  a  particular  T 
value  are  computed  from  this  conditional  distribution. 


Estimation  of  the  Parameters 

Interest  is  now  in  estimating  ,  u^ ,  and  x,,  g=l , . . . ,  n,  i“l,...,N  simul- 

taneously.  The  set  of  admissible  values  of  these  parameters  must  be  chosen  that 
makes  the  log-likelihood  function,  InL,  the  maximum.  Unlike  the  case  of  dealing 
with  performance  scores,  response  time  represents  two  different  cases— one  is  a 
group  whose  members  obtained  the  correct  answer  and  the  other  is  a  group  whose 
members  obtained  the  wrong  answers.  Response  time  in  the  OK  subgroup  means  the 
time  needed  to  attain  a  given  goal  using  a  successful  information-processing 
skill  (or  skills);  but  it  is  not  that  simple  with  the  NO  subgroup.  A  brief  in¬ 
vestigation  of  error  analysis  for  the  NO  subgroup  indicates  that  various  kinds 
of  misconceptions  at  different  progressive  stages  of  reaching  a  correct  answer 
for  a  given  item  might  have  occurred.  Therefore,  only  the  OK  subgroup  will  be 
considered  In  this  paper. 


Differentiating  the  logarithm  of  Equation  5  by  parameters  and  u  ,  re- 

o  o 

spectively,  and  setting  the  results  equal  to  zero  gives  the  following  simultan¬ 
eous  equations: 


[10] 


2This  has  not  yet  been  solved  but  a  mathematician  colleague  assures  that  it  is 
soluble. 


I 


J  -  X 
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[11] 


[12] 


The  maximum  likelihood  method  using  the  Newton-Raphson  iteration  procedure 
provides  estimates  of  the  roots  of  Equations  10  and  11  where  T,  i*l,...,N  are 
substituted  by  the  mean  response  time  of  each  person  over  items,  g=l,...~n. 
Then,  the  roots  of  Equation  12,  after  the  newly  estimated  c  and  u  valuels  are 
substituted,  are  sought  by  the  same  procedure.  "" ®  ~ ® 


A  sufficient,  but  not  necessary,  condition  that  any  of  these  stationary 
values,  u,  c,  be  local  maxima  is  that 


[14] 


9  2  InL 
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<  0  [15] 


It  should  be  noted  that  Equations  13  and  15  are  always  negative,  but  Equation  15 
will  be  negative  only  when  the  estimates  of  u  are  close  enough  to  be  the  roots 

o 

of  Equation  11.  In  some  earlier  stages  of  iterations,  the  condition  for  iig  to 

yield  a  local  maximum  might  not  be  satisfied.  Thus,  it  is  important  to  select 

appropriate  starting  values  for  the  estimates  of  u  . 

8 
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For  estimation  of  x,  solutions  of  Equation  12  are  sought  for  which  Equation 
15  is  satisfied.  If  c  «  1  for  all  g,,  g«l,...,n,  then  Equation  12  beeves 

O 


Since  scale  parameter  Ug  equals  the  mean  of  observed  response  time  when  shape 
parameter  c  is  equal  to  unity,  this  equation  becomes  equivalent  to 


But  the  reciprocal  of  observed  mean  cannot  be  zero  for  any  item  jj;  therefore, 
there  should  be  some  g  such  that  ^  1.  This  implies  that  the  maximum  likeli- 

D 

hood  method  does  not  work  for  response  time  models  associated  with  the  negative 
exponential  distributions  as  long  as  the  models  are  formulated  assuming  unidi¬ 
mensionality.  Moreover,  the  notion  of  PCRR,  which  is  parallel  to  that  of  PCC, 
will  not  be  applicable  to  these  models.  This  is  because  CRR  functions  are  al¬ 
ways  parallel  to  the  horizontal  axis  when  the  occurrence  of  a  response  is  a  ran¬ 
dom  event  and  all  the  random  events  are  assumed  to  be  of  the  same  kind,  which 
is  the  case  of  negative  exponential  distributions. 


Numerical  Example 

The  parameters  in  the  model  were  successfully  estimated3 for  sample  data, 
the  pretest  data  of  the  signed  number  arithmetic  lessons.  Unfortunately,  the 
posttest  data  of  signed  number  operations  described  above  (see  also  Appendix 
Tables  A-l,  A-2,  or  A-3)  could  not  provide  stable  estimates  with  this  computer 
program.  The  sample  size  for  Group  2  was  too  small  and  there  was  not  a  large 
enough  number  of  items — the  14  items  that  were  of  most  interest.  When  the  ob¬ 
served  response  time  data  were  fitted  to  the  Weibull  distributions  before,  it 
was  observed  that  the  items  testing  the  same  skill  in  the  pretest  showed  a  sys¬ 
tematic  change  with  the  estimates  of  u  and  t0  according  to  their  order  of  pre¬ 
sentation,  even  though  the  difficulties  of  these  parallel  items  did  not  show  any 
noticeable  change. 

With  this  new  model  the  changes  in  the  slopes  of  the  parallel  items  have  a 
strong  tendency  toward  being  monotonically  increasing.  For  example.  Items  3  and 
45  ask  "-3+2-?”  and  "-7+5-?,"  respectively.  The  dotted  lines  in  Figure  7  are 
CRR  functions  associated  with  the  observed  response  time,  and  the  solid  lines 
are  the  theoretically  derived  CRR  functions.  It  is_in teres ting  to  note  that  the 
random  variable  x^  +  d^g  can  be  rewritten  as  (x^  -  t^  )  +  t^g  -  T^g.  Denoting 

x^  -  Jtj  -  -£j,  then  Tjg  can  be  expressed  as  t^g-e^.  Hence,  the  observed  re¬ 
sponse  time  ^.jg  becomes  the  sum  of  a  true-score-like  T^g  and  an  "error"  e^. 

3A  computer  program  for  estimating  parameters  Ug,  Cg,  and  x^  for  g-1,...,  ti, 
1-1,..., N  was  written  on  the  PLATO  system  by  Robert  Baillie. 
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Therefore  It  might  be  considered  that  the  theoretical  CRR  functions  are  defined 
on  a  pseudo  true  score  random  variable,  Tig . 


Figure  7 

CRR  Functions  Defined  on  the  Random  Variable  T^g 
F_\t )  /  [  1  -  F(t)]»  F(t)  ■  Weibull  Distribution  Function 


Item 

*0 

a 

W0 

1 

3 

5 

1.09 

8.80 

13.517] 

45 

4 

1.22 

6.28 

9.883) 

3 

0 

1.95 

12.35 

10.951] 

45 

0 

1.68 

8.36 

7.465) 

observed 

theoretical 


Summary  and  Discussion 


The  customary  method  for  assigning  a  score  to  an  individual  on  adaptive 
tests — or,  for  that  matter,  whenever  a  latent  trait  model  is  employed— is  to  use 
the  estimate  9  of  the  ability  (or  achievement)  parameter.  This  may  be  adequate 
when  the  only  purpose  of  testing  Is  to  calibrate  the  individual's  ability  or 
achievement  level.  When  the  further  purpose  of  using  0  as  the  basis  for  routing 
the  student  to  a  suitable  starting  point  in  a  lesson  series  is  involved,  howev¬ 
er,  sole  reliance  on  6  can  create  serious  problems.  This  is  because  two  exam¬ 
inees  may  have  identical  response  patterns  (and,  hence,  6  values)  and  yet  differ 
drastically  in  the  manners— the  cognitive  processes  and  information-processing 
skills  that  are  brought  into  play— in  which  they  arrived  at  their  answers  to  the 
items,  correct  or  incorrect,  as  the  case  may  be.  Efficient  and  effective  rout¬ 
ing  of  students  to  lessons  requires  this  deeper  diagnosis  instead  of  mere  infor¬ 
mation  as  to  which  items  they  get  right  or  wrong. 

Increasing  recognition  is  being  given  to  this  fact,  as  evidenced  by  the 
number  of  studies  either  directly  or  indirectly  germane  to  it  that  have  recently 
been  done  by  cognitive  psychologists  (e.g.,  Anderson,  Kline,  &  Beasley,  1978; 
Carroll,  1978;  Frederiksen,  1978;  Greeno,  1977;  Groen  &  Perkum,  1972;  Heller  & 


!Ev3ssBK33aD3saaa&”  •  jirarawr 


-  252  - 


Greeno,  1978;  Rose,  1978;  Sternberg,  1979).  These  studies  have  demonstrated  the 
existence  of  a  variety  of  cognitive  processes,  which  differ  from  individual  to 
individual. 

One  clue  to  the  type  of  cognitive  process  employed  by  a  student  in  solving 
a  given  problem  can  come  from  knowing  his/her  instructional  background.  Fortu¬ 
nately,  a  follow-up  study  of  Tatsuoka  and  Birenbaum  (1979)  indicates  that  the 
Weibull  shape  parameter  £,  obtained  by  fitting  response  time  data,  is  helpful  to 
differentiate  among  various  instructional  methods  associated  with  signed  number 
operations.  The  Weibull  distributions  can  be  mathematically  derived  from  the 

assumptions  that  the  CRR — essentially,  the  conditional  probability  that  a  pesot 

will  respond  to  a  given  item  during  the  interval  [£,£  +  dt] ,  given  that  he/she 
has  not  responded  to  the  item  up  to  the  time  £ — is  monotonically  increasing, 
decreasing,  or  constant.  The  slope  of  the  CRR  function  for  a  given  item  is  de¬ 
termined  by  the  magnitude  of  the  shape  parameter  and  the  mean  of  the  item  re¬ 

sponse  time.  If  £  is  larger  than  1,  then  CRR  is  a  monotonically  increasing 
function.  If  £  *  1,  then  CRR  is  constant. 

Some  types  of  information-processing  skill  require  a  greater  amount  of  in¬ 
volvement  in  a  student's  effort  in  solving  a  given  problem,  whereas  others  do 
not  require  so  much  to  obtain  the  answer  to  the  same  item.  The  magnitude  of  the 
shape  parameter  £  and  mean  response  time  for  the  former  become  noticeably  larger 
than  those  for  the  latter.  Therefore,  the  slopes  of  CRR  functions  differ  in 
steepness  to  a  greater  extent.  This  sensitivity  of  the  Weibull  distributions  to 
the  procedures  associated  with  different  teaching  methods  is  an  advantage  in 
dealing  with  psychological  research.  As  Scheiblechner  (1979)  has  stated,  "the 
exponential  or  Weibull  distribution  is  an  adequate  model  for  more  sorts  of  psy¬ 
chological  data  than  is  commonly  assumed  if  the  parameteric  structure  of  the 
latencies  is  properly  chosen." 

First,  it  was  assumed  that  for  a  given  set  of  items  there  exists  a  latent 
variable  affecting  the  time  taken  by  an  examinee  to  answer  each  of  these  items. 

A  model  associated  with  response  time,  roughly  paralleling  latent  trait  theory, 
was  formulated  on  the  strength  of  the  observed  fact  that  the  Weibull  distribu¬ 
tion  fits  the  response  time  data  for  most  items.  The  main  concern  in  the  model 
is  to  express  the  relationship  between  latent  response  time  variable  and  the 
information-processing  skills. 

An  estimation  routine  of  the  parameters  by  the  maximum  likelihood  method 
was  programmed  and  a  numerical  example  was  shown.  The  maximum  likelihood  method 
is  not  applicable  to  estimate  Weibull  parameters  when  all  shape  parameters  are 
supposed  to  be  1,  that  is,  the  cases  of  negative  exponential  distributions. 
Further  research  will  be  necessary  in  exploring  a  different  parameter  estimation 
procedure,  such  as  the  conditional  maximum  likelihood  method. 

Information  function  of  item  g,  1(0)  was  integrated  numerically  and  found 

o 

to  be  always  constant  except  for  c  *  1.  However,  its  discussion  will  be  re¬ 
ported  in  another  paper. 

The  particular  approach  attempted  here  for  defining  item  CRR  and  PCRR  func¬ 
tions  resulted  in  the  loss  of  the  attractive  feature  of  capability  to  provide 
mathematical  meaning  to  the  curves  in  terms  of  t^.  However,  this  attractive 
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feature  still  holds  for  the  variable  mentioned  in  the  numerical  example,  that 
is,  An  alternative  approach  was  outlined,  but  further  research  is  neces¬ 

sary  to  make  this  approach  operational. 
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Latent  Trait  Scoring  of  Timed  Ability  Tests 


David  Thissen 
University  of  Kansas 


The  advent  of  computerized  testing  has  made  timed  testing  a  feasible  pro¬ 
cess.  Paper-and-pencil  testing  technology  limited  the  test  theorist  to  an  anal¬ 
ysis  of  the  responses  of  the  examinees.  Computerized  testing,  on  the  other 
hand,  has  the  potential  to  provide  the  tester  with  a  great  deal  more  information 
than  that  contained  in  the  responses  alone.  As  adaptive  tests  become  more  effi¬ 
cient,  and  as  they  become  shorter,  each  item  and  its  associated  response  must 
provide  more  and  more  information  about  the  ability  of  the  examinee.  But  only  a 
limited  amount  of  information  can  be  obtained  from  binary  responses,  and  even 
the  use  of  three  or  more  response  categories  provides  only  limited  increases  in 
the  amount  of  information  provided  by  any  one  item  (Bock,  1972;  Samejima,  1969; 
Thissen,  1967b).  The  additional  information  needed  must  come  from  some  other 
response  variable,  preferably  a  continuous  one;  response  latency  is  a  likely 
candidate. 

Although  there  exist  completely  general  latent  trait  test  item  response 
models  for  ordered  or  for  strictly  nominal  item  responses  in  any  number  of  re¬ 
sponse  categories  (Bock,  1972;  Samejima,  1969),  as  well  as  at  least  one  rela¬ 
tively  specialized  latent  trait  model  for  continuous  item  responses  (Samejima, 
1973),  there  has  been  little  work  on  item  response  models  for  timed  test  data. 
White  (1973)  proposed  a  model  for  individual  differences  in  speed  and  accuracy 
in  a  timed  testing  situation;  but  in  that  system  the  response  latencies  were 
used  as  a  fixed  part  of  the  model  rather  than  as  observations  measured  with  er¬ 
ror.  It  seems  more  in  keeping  with  the  spirit  of  latent  trait  test  theory  to 
consider  the  latencies,  like  the  item  responses  themselves,  to  be  fallible  data 
reflecting  underlying  trait  values. 

In  research  predating  contemporary  statistical  estimation  schemes  for  the 
application  of  latent  trait  item  response  models,  Furneaux  (1961)  suggested  that 
the  normal  o give  model  would  serve  quite  well  for  item  responses  obtained  in  a 
timed  testing  situation.  He  found  that  for  timed  responses  to  letter  series 
problems,  the  log  transformed  latencies  were  linearly  related  to  heuristical ly 
estimated  normal  ogive  ability  and  difficulty  parameters.  Following  lines  sug¬ 
gested  by  Furneaux's  findings,  Thissen  (1976a)  suggested  an  integrated  model  for 
the  item  responses  and  latencies  obtained  in  timed  testing  and  developed  a 
scheme  for  the  joint  maximum  likelihood  estimation  of  the  parameters  of  that 
mode  1 . 


* 


In  that  earlier  research,  the  results  of  the  application  of  the  timed  test¬ 
ing  model  to  data  from  a  test  of  spatial  visualizing  ability,  as  well  as  to  data 
from  the  Matching  Familiar  Figures  test  (Kagen,  Rosman,  Day,  Albert,  &  Phillips, 
1964)  and  to  a  laboratory  perception  task,  were  discussed.  The  present  paper 
concerns  the  application  of  the  same  timed  testing  model  to  data  obtained  with 
three  tests  of  classical  form:  a  subset  of  the  Raven  (1956)  Progressive  Matri¬ 
ces,  a  version  of  the  Guilford-Zimmerman  (1953)  Aptitude  Survey  "Clocks"  Spatial 
Visualization  Test,  and  a  Verbal  Analogies  test  drawn  from  the  Minnesota  Multi¬ 
modal  Analogy  Test,  consisting  of  items  described  by  Tinsley  (1971)  and  analyzed 
extensively  by  Whitely  (1977). 

The  Model 

The  item  response  model  used  here  was  actually  developed  primarily  on  the 
basis  of  exploratory  data  analysis  of  timed  test  data;  that  development  is  de¬ 
scribed  elsewhere  (Thissen,  1976a).  In  this  section  a  development  of  this  mod¬ 
el,  which  is  not  entirely  based  on  the  form  of  timed  test  response  data,  will  be 
discussed. 


The  technology  for  latent  trait  item  analysis  of  item  response  data  is  well 
established  when  the  items  are  scored  dichotomously.  The  logistic  model  de¬ 
scribed  by  Birnbaum  (1968)  has  been  useful  in  many  situations;  therefore,  it 
would  seem  appropriate  to  use  that  model,  at  least  as  a  first  approximation,  for 
the  item  response  data  obtained  in  the  timed  testing  situation.  If  the  item 
responses  r^j  for  person  i_  responding  to  item  are  jr^j  =  1  and  if  the  response 

of  person  i_  to  item  j_  is  correct  and  0  otherwise,  then  the  logistic  model  is 


where 


and 


P(r  .  . 

I'd 

=  1)  = 

1/[1  +  exp ( -z^j) ] 

z  .  . 

=  a  .6  . 
J  t 

+  <?.; 

-to 

J 

P(r  .  . 

-to 

=  0)  = 

1  -  P<*ij  =  !>• 

[1] 


[2] 


In  the  context  of  the  timed  testing  situation,  9^  could  be  called  the  "effective 

ability"  of  person  i  (following  White,  1973)  to  distinguish  this  ability  esti¬ 
mate — which  is  obtained  under  circumstances  allowing  the  examinee  any  amount  of 
time  to  respond — from  more  conventional  ability  estimates  obtained  with  speeded 
tests.  The  parameter  jj j  is  the  discrimination  parameter,  or  "slope,"  of  item  j_; 

it  reflects  the  (possibly)  different  degrees  of  relationship  between  the  items 
and  the  trait  being  measured.  The  parameter  c_j  is  the  easiness  of  item  j_. 

To  follow  the  traditional  forms  of  latent  trait  test  theory,  the  response 
time  of  person  i_  on  item  j_,  _t^j ,  should  also  be  a  function  of  some  parameters 

describing  characteristics  of  person  i^  and  item  j_.  It  would  be  interesting  and 
useful  if  those  parameters  were  the  same  as  the  parameters  that  are  used  to  de¬ 
scribe  the  item  responses;  that  would  mean  that  the  variations  in  item  responses 
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and  response  times  were  attributable  to  the  same  sources  and  that  the  responses 
and  latencies  could  both  be  used  in  the  measurement  of  the  ability  of  each  exam¬ 
inee  and  the  easiness  of  each  item.  Latency  should  clearly  be  a  decreasing 
function  of  the  z  •  - ;  that  is,  increases  in  z- •  (which  imply  increased  person 

— i  j  — 1 J 

ability  or  item  easiness)  should  be  related  to  decreases  in  response  latency. 

The  form  of  the  random  error  involved  in  the  measurement  of  response  latency 
must  also  be  specified;  data  analytic  considerations  suggest  that  response  times 
are  frequently  approximately  log-normally  distributed,  so  a  linear  model  for  the 
logarithm  of  latency  could  be  assumed  to  include  Gaussian  error.  This  suggests 
the  following  model: 

l°g(*y.)  -  V  -  bz..  +  e  I  e..~tf(O,02)  ,  [3] 

t't/  ‘'J  I'd 

in  which  v  is  the  overall  mean  log  response  time  and  b  is  a  regression  parameter 
reflecting  the  relationship  of  effective  ability  and  easiness  (in  the  logit  z^j) 

with  latency,  and  the  scale  conversion  of  logits  to  log  seconds.  Of  course,  the 
parameters  of  zjj  in  Equation  3  are  the  same  as  the  parameters  of  _z^j  in  Equa¬ 
tion  1;  they  may  be  simultaneously  estimated  using  current  constrained  maximum 
likelihood  estimation  techniques. 

However,  as  it  stands,  the  model  described  by  Equation  3  is  unlikely  to 
provide  a  very  good  fit  to  observed  data.  There  are  likely  to  be  attributes  of 
both  the  examinees  and  the  items  that  contribute  consistently  to  latency  but 
that  are  unrelated  to  the  trait  the  items  are  intended  to  measure.  For  verbal 
items  either  the  length  of  the  item  (and  the  associated  reading  time)  or  the 
number  of  attempts  required  to  obtain  the  needed  semantic  data  may  contribute  to 
response  latency;  but  these  factors  could  be  unrelated  to  the  easiness  of  that 
item  (and  the  probability  of  a  correct  response).  Some  examinees  may  press  the 
response  keys  more  slowly  than  others  in  a  pattern  that  is  consistent  but  unre¬ 
lated  to  their  ability.  The  addition  of  person  and  item  slowness  parameters, 
and  Uj ,  to  Equation  3  results  in  the  model 

log (*ij)  “  V  +  si  +  Uj  -  bzij  +  [4] 

where 

The  hybrid  model  formed  by  the  combination  of  the  model  in  Equation  1  for 
the  item  response  data  and  Equation  4  for  the  response  times  will  be  used  in  the 
analysis  of  timed  test  data  in  the  sequel.  This  hybrid  model  includes  a  number 
of  assumptions  about  functional  forms:  The  logistic  is  used  as  the  probability 
of  a  correct  item  response  as  a  function  of  effective  ability,  and  additive  lin¬ 
ear  effects  and  Gaussian  error  are  assumed  for  the  log  of  response  time.  As¬ 
sumptions  such  as  these  must  be  made  to  provide  the  basis  for  the  maximum  like¬ 
lihood  parameter  estimation. 

Two  comments  are  in  order  about  the  functions  included  in  the  model. 

First,  after  the  parameters  of  the  model  have  been  estimated,  the  appropriate¬ 
ness  of  the  assumed  functional  forms  may  be  evaluated  to  some  extent  by  an  exam- 
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ination  of  the  goodness  of  fit  of  the  various  features  of  the  model.  The  empir¬ 
ical  proportions  of  correct  responses  at  various  levels  of  ability  may  be  com¬ 
pared  to  the  logistic  predictions;  the  normality  of  the  actual  residuals  of 
log(0  from  the  linear  model  may  be  evaluated.  To  the  extent  that  the  function¬ 
al  forms  included  in  the  model  follow  the  observed  data,  the  parameters  of  the 
model  should  provide  a  useful  summary  of  the  item  response  data. 

Secondly,  this  model  is  not  being  proposed  as  a  process  model  for  the  psy¬ 
chological  description  of  item  response  behavior.  Although  such  process  models 
have  been  developed  for  choice  behavior  (see  Audley,  1960,  1973;  Laming,  1968; 
Luce,  1960)  and  are  currently  being  developed  for  specific  types  of  complex  cog¬ 
nitive  abilities  (see  Sternberg,  1977;  Whitely,  1979),  they  tend  to  be  too  com¬ 
plex  for  current  application  in  practical  testing  and  measurement.  The  model 
proposed  here  follows  the  tradition  of  earlier  test  theory  in  that  it  attempts 
to  generally  describe  responses  to  a  variety  of  possible  kinds  of  test  items. 

It  is  to  be  hoped  that  the  parameters  of  the  hybrid  timed  testing  model  may  be 
useful  in  some  psychological  process  research,  such  as  summaries  of  attributes 
of  test  items  and  examinees,  which  may  be  compared  to  theoretical  predictions. 
Beyond  that,  the  current  model  is  meant  to  be  a  general  model  for  the  measure¬ 
ment  of  examinees  and  the  calibration  of  test  items;  such  a  model  will  likely  be 
superseded  by  accurate,  complete  psychological  models  for  cognitive  processes  as 
soon  as  they  become  available  for  cognitive  tasks  in  the  domain  of  ability  mea¬ 
surement  . 

Estimation  of  Model  Parameters 


The  parameters  of  the  timed  testing  model  to  be  estimated  form  the  set  £  = 
(0,  s^,  .a,  o_,  u^,  v,  b^,  a2};  there  are  2N  +  3n  independent  parameters  (for  N  per¬ 
sons  and  n  itemsT,  since  the  location  and  scale  of  0,  as  well  as  the  location  of 
the  slowness  parameters,  must  be  set  arbitrarily.  With  the  usual  assumption  of 
local  independence  extended  to  include  the  assumption  that  the  error  of  observa¬ 
tion  of  latency  is  independent  of  the  response  to  the  same  item,  conditional  on 
the  parameters,  the  likelihood  of  the  entire  set  of  data,  given  the  parameters 
may  be  written 

MK.Tlt)  -  n{  tj  P.J  15) 

in  which 

I'u !• 

vii  ‘  P<-ii  ’  1)1  ^  Pi-i j  '  0>U  '  ■ 


and 


H .  .  =  Hd.  .) 

•z-J  t-J 
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where 

ij>(d)  is  the  standard  normal  density 

and 

di 3  =  {lo*<*ij>  -  lV  +  Si  +  U3  ~  bzij]}la  • 

A  two-stage  procedure  for  locating  the  maximum  of  L(R,  T|£)  is  described  in 
detail  by  Thissen  (1976a).  The  first  stage  of  the  procedure  makes  use  of  the 
conjugate  gradient  method  (Fletcher  &  Reeves,  1964)  for  function  minimization 
(maximization,  in  this  case)  to  approach  the  joint  maximum  of  the  log  likelihood 
in  the  2N  +  3n  dimensional  parameter  space.  This  procedure  requires  1.5  to  2 
times  as  many  iterations  as  there  are  parameters  to  be  estimated;  but  it  pro¬ 
ceeds  quickly,  as  the  conjugate  gradient  technique  requires  computation  only  of 
the  first  derivatives  of  the  log  likelihood.  As  the  maximum  is  approached,  and 
the  corrections  become  smaller,  the  conjugate  gradient  algorithm  generally  en¬ 
counters  difficulty  in  locating  an  appropriate  direction  in  the  parameter  space, 
which  has  dimensionality  of  several  hundred.  At  that  point  the  second  stage  of 
the  estimation  procedure  is  entered.  The  second  stage  is  a  cyclic  procedure  in 
which  each  person's  parameters  are  estimated  individually  using  the  current  val¬ 
ues  of  the  item  and  the  overall  test  parameters;  then  each  item's  parameters  are 
similarly  estimated,  the  overall  test  parameters  are  revised,  and  the  procedure 
is  repeated.  In  the  second  stage,  the  maximizations  in  few  dimensions  for  each 
subject  and  item  are  performed  using  a  conditioned  Newt on-Raph son  algorithm  pro¬ 
grammed  by  Haberman  (1970).  By  the  time  the  Newton-Raphson  corrections  are  of 
the  order  10-3for  each  parameter,  there  is  generally  no  appreciable  change  be¬ 
tween  cycles,  and  the  procedure  is  stopped. 

A  useful  feature  of  the  second  stage  estimation  procedure  is  that  since  the 
Newton-Raphson  iterations  require  matrices  of  second  partial  derivatives  for  the 
person  parameters  (within  each  person)  and  the  item  parameters  (within  each  per¬ 
son)  and  the  item  parameters  (within  each  item),  those  matrices  may  be  treated 
as  information  matrices  and  may  be  inverted  to  give  estimated  standard  errors 
for  the  parameters.  The  resulting  estimates  for  the  standard  errors  are,  unfor¬ 
tunately,  somewhat  smaller  than  they  ought  to  be,  as  their  computation  ignores 
the  fact  that  all  of  the  other  person  (or  item)  parameters  have  also  been  esti¬ 
mated  from  the  same  set  of  fallible  data.  However,  very  limited  monte  carlo 
results  (Thissen,  1976a)  indicate  that  the  bias  is  not  too  large. 

The  Data 

The  data  analyzed  were  the  responses  of  78  University  of  Kansas  undergradu¬ 
ate  students  to  three  tests  of  cognitive  ability.  The  students  were  selected  to 
participate  in  a  study  of  individual  differences  in  cerebral  laterality;  there¬ 
fore,  left-handed  individuals  were  substantially  over-represented  in  the  sample. 
The  students  partially  satisfied  a  research  participation  requirement  in  an  in¬ 
troductory  psychology  course  by  their  participation  in  the  study.  The  students 
were  largely  college  freshman:  36  were  male  and  42  were  female. 
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Tests 


The  Verbal  Analogies  test  consisted  of  a  27-item  subset  of  the  60  verbal 
analogies  from  the  Minnesota  Multimodal  Analogies  test  described  by  Tinsley 
(1971)  and  Whitely  (1977).  These  analogies  are  quite  difficult  for  college  stu¬ 
dents;  the  raw  scores  ranged  from  4  to  24,  and  the  average  proportion  correct 
was  .57.  No  item  was  answered  correctly  by  all  of  the  students. 

The  Progressive  Matrices  Test  consisted  of  Sets  B,  C  and  D  of  the  Raven 
Colored  (Set  B)  and  Standard  (Sets  C  and  D)  Progressive  Matrices  (1956).  There 
were  thus  a  total  of  36  items,  of  which  two  (B-l  and  B-3)  were  answered  correct¬ 
ly  by  more  than  98%  of  the  students;  those  two  items  were  omitted  from  further 
analyses.  The  Progressive  Matrices  Test  was  easier  for  this  group  than  the  Ver¬ 
bal  Analogies  test;  the  average  proportion  correct  was  .72.  The  raw  scores 
ranged  from  12  to  34. 

The  Clocks  spatial  test  consisted  of  19  items  drawn  from  a  set  of  color 
slides  made  to  resemble  items  on  the  Guil ford -Zimmerman  (1953)  "Clocks"  test  of 
spatial  visualizing  ability.  The  items  give  pictorial  instructions  about  the 
rotation  in  three  dimensions  of  a  large,  old-fashioned  alarm  clock,  and  the  ex¬ 
aminee  is  required  to  select  from  a  set  of  alternatives  a  view  of  the  clock  as 
it  would  appear  in  the  rotations.  Two  of  the  items  were  too  easy  for  this  group 
and  were  omitted  from  the  analyses;  therefore,  Clocks  was  a  17-item  test,  on 
which  the  raw  scores  ranged  from  5  to  16. 

The  test  items  were  presented  by  a  slide  projector  and  rear  projection 
screen,  which  was  located  immediately  in  front  of  the  examinee.  All  test  items 
were  in  multiple-choice  format,  with  from  four  to  eight  alternatives;  the  stu¬ 
dents  responded  by  pressing  a  numbered  response  key  on  a  calculator-like  key¬ 
board.  The  presentation  of  the  slide  triggered  a  light  sensor,  which  started 
the  timer;  the  response  of  the  examinee  stopped  the  timer  and  initiated  a  dis¬ 
play  of  the  response  and  latency  for  the  examiner. 

Data  Preparation 


Before  the  iterative  estimation  of  the  parameters  of  the  latent  trait  model 
was  begun,  the  data  were  trimmed  of  extreme  outliers,  following  the  procedure 
suggested  earlier  (Thissen,  1976a).  Using  heuristically  obtained  starting  val¬ 
ues  for  the  parameters  of  the  timed  testing  model,  observations  for  which  the 
observed  latency  deviated  from  the  predicted  latency  of  the  model  by  more  than 
three  standard  deviations  were  removed  from  the  sample.  Since  missing  data  are 
tolerated  by  the  estimation  procedure,  those  latencies  were  not  replaced  in  any 
way.  Most  responses  trimmed  had  very  long  latencies  (more  than  a  minute  in  many 
cases).  Less  than  .5%  of  the  observations  were  trimmed  in  this  way. 

Results 


Verbal  Analogies 


The  Verbal  Analogies  included  in  this  set  were  unconventional  in  a  variety 
of  ways:  The  blank  (to  be  filled  in  by  the  examinee's  response)  could  be  any  one 


of  the  four  terms  of  the  analogy;  the  terms  in  the  analogy  could  be  permuted 
(from  a:b::c:d  to  a:c::b:d,  and  so  on);  and  a  number  of  different  types  of  rela¬ 
tionships  could  be  obtained  between  the  analogy  elements.  All  of  these  factors 
apparently  contributed  to  complaints  that  it  was  a  very  bad  test  from  both  the 
research  assistants  who  administered  the  test  and  the  students  who  took  the 
test.  That  it  was  a  very  difficult  test  also  probably  contributed  to  the  nega¬ 
tive  feelings  the  examinees  voiced.  Nevertheless,  the  latent  trait  analysis 
with  the  timed  testing  model  revealed  that  the  response  data  from  this  test  were 
fitted  by  the  model  more  closely  than  any  other  test  data  to  which  the  model  has 
been  applied.  And  the  results  which  were  obtained  in  the  comparisons  of  the 
results  of  this  analysis  to  previous  analyses  of  this  same  pool  of  items  indi¬ 
cates  that  this  was,  in  fact,  a  very  good  test  for  measuring  analogical  reason¬ 
ing  ability,  even  though  the  examinees  did  not  like  it. 

The  goodness  of  fit  of  the  data  to  the  timed  testing  model  may  be  measured 
in  a  number  of  ways.  Since  the  logistic  item  response  model  gives  a  probability 
of  a  correct  response  at  any  value  of  0 ,  the  examinees  may  be  ordered  according 
to  their  estimated  effective  ability  level,  divided  into  relatively  homogeneous 
ability  groups,  and  the  proportion  of  correct  responses  in  each  of  the  ability 
groups  may  be  compared  to  the  proportion  expected  if  all  of  the  examinees  in 
each  group  had  the  same  ability,  namely,  the  mean  for  that  ability  group.  This 
procedure  yields  a  contingency  table  (correct/incorrect  responses  x  ability 
fractile)  for  each  item  and  a  X2  test  of  the  goodness  of  fit  of  the  model.  In 
this  analysis  the  examinees  were  divided  into  6  equal  groups  and  the  resulting  4 
degree-of-freedom  X2  tests  were  computed;  only  1  of  the  27  values  was  signifi¬ 
cant  at  the  £  *  .05  level.  The  total  of  the  likelihood  ratio  \z  values  was 
131.2  with  108  degrees  of  freedom  (.10  >  jj  >  .05),  which  was  nonsignificant  and 
indicates  that  the  probabilities  of  correct  responses  are  predicted  fairly  well 
by  the  model. 

Once  the  parameters  of  the  model  were  estimated,  the  assumption  of  normali¬ 
ty  of  the  residuals  of  log  (t)  from  the  model  were  examined  by  computing  the 
actual  residuals  (2,098  of  them,  in  this  case),  dividing  them  into  fractiles, 
and  comparing  their  distribution  to  Gaussian  expectation.  Again,  a  x2  test  of 
the  goodness  of  fit  was  used;  in  this  example,  for  10  fractiles  normal  error 
seemed  to  be  met  in  the  data.  The  result  masks  a  small  violation  of  the  assump¬ 
tions  of  the  model:  Correct  responses  were  slightly,  but  significantly,  faster 
than  incorrect  responses;  the  difference  was  .22  standard  deviation  units  in  log 
t  ime . 

The  goodness  of  fit  of  the  model  for  the  latencies  may  be  strikingly  por¬ 
trayed  using  data  from  individual  examinees  and  items.  Figure  1  shows  a  scat- 
terplot  of  the  observed  log  response  times,  t*,  and  the  log  response  times  esti¬ 
mated  from  the  model,  t_*,  for  an  individual  student  across  the  27  items.  The 
correlation  was  .59. 

Figure  2  is  a  similar  scatterplot  of  the  log  latencies,  t*,  and  the  pre¬ 
dicted  latencies,  t*,  for  all  78  subjects  on  one  of  the  analogy  items, 

"cent:dime:: . ‘.dollar. "  The  correlation  between  the  observed  and  fitted  log 

latencies  for  this  item  was  .65. 
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Figure  1 

Scatterplot  of  the  Observed  Log  Response  Times,  t*, 
and  the  Fitted  Log  Response  Times,  t*, 
for  Student  24  for  the  Analogies  Test 


The  fit  to  the  latency  data  comes  both  from  the  person  and  item  slowness 
parameters,  s^  and  ,  and  from  negative  effects  due  to  effective  ability  and 

item  easiness,  since  the  logit  regression  parameter,  b,  had  a  value  of  .20.  The 
quantitative  meaning  of  the  regression  of  latency  on  effective  ability  varies 
along  the  time  scale  due  to  the  log  transformation;  but  for  an  average  item  (in 
all  respects)  a  person  of  average  slowness  and  average  effective  ability  (0=0) 
should  respond  in  about  12.5  seconds  and  a  less  able  individual  (0=  -1)  should 
respond  in  about  15.2  seconds.  The  timed  testing  model  makes  use  of  the  rela¬ 
tionships  between  latency  and  the  probability  of  a  correct  response  to  provide 
relatively  precise  estimates  of  all  of  the  parameters  involved.  Some  35%  to  50% 
of  the  variance  in  log  latency  within  an  individual  or  an  item  is  predicted  by 
the  model. 


Since  some  of  the  item  parameters  in  the  timed  testing  model  are  the  same 
as  those  used  in  conventional  latent  trait  analysis,  the  timed  testing  model  may 
be  evaluated  as  an  item  calibration  scheme  by  comparing  the  item  parameters  es¬ 
timated  using  the  timed  testing  model  with  item  parameters  estimated  using 
large-sample,  response-only  latent  trait  techniques.  Item  easiness  was  estimat¬ 
ed  by  Tinsley  (1971)  for  the  analogy  items  used  here  with  the  Rasch  (1960)  1-pa¬ 
rameter  logistic  model,  with  data  from  641  subjects.  In  Figure  3  the  resulting 
item  easiness  parameters  are  plotted  against  c^/a ,  the  corresponding  transforma¬ 
tion  of  the  easiness  parameter  of  the  2-parameter  logistic  model  used  hern.  The 
correlation  between  the  two  sets  of  easiness  parameters  was  .80  (or  .70  when  the 
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Figure  2 

Scatterplot  of  the  Observed  Log  Response  Times, 
and  the  Fitted  Log  Response  Times,  £* , 
for  One  Analogy  Item 


outlier  in  the  upper  right-hand  corner,  which  was  the  easiest  item  in  the  test, 
was  removed)  . 

Large-sample  estimates  of  2-parameter  logistic  slopes  were  not  available 
for  this  group  of  items.  However,  slopes  for  each  item  estimated  using  the 
1-parameter  logistic  ability  were  available  from  the  same  large-sample  item  cal¬ 
ibration  study,  and  in  Figure  4  they  are  plotted  against  the  estimated  in  the 

present  study.  The  correlation  was  only  .50,  and  only  two-thirds  of  the  points 
were  within  two  of  their  own  standard  errors  from  the  regression  line. 

The  results  shown  in  Figure  4  do  not  seem  very  good,  until  it  is  recalled 
that  the  large-sample  slope  estimates  were  not  very  precise,  since  they  were 
estimated  using  ability  derived  from  a  1-parameter  logistic  model;  and,  even 
with  the  timed  testing  model,  the  standard  errors  of  slopes  estimated  with  only 
78  subjects  were  liberally  estimated  to  be  about  0.12  for  this  test.  All  things 
considered,  the  relationship  between  the  large-sample  slopes  and  those  estimated 
by  the  timed  testing  model  was  about  as  strong  as  would  be  expected,  given  the 
estimation  procedures  and  sample  sizes  involved. 

The  item  parameters  estimated  in  this  analysis  revealed  some  construct  and 


-  266  - 


Figure  3 

Timed  Testing  Easiness  (c_j/a^) 

Plotted  Against  Large-Sample  Easiness  Parameters 
for  the  Analogies  Items 


face  validity  when  they  were  related  to  individual  item  characteristics.  Using 
data  from  107  persons  (who  sorted  these  and  other  analogy  items  into  categories 
based  on  perceived  similarity)  and  Wiley's  (1967)  latent  partition  analysis, 
Whitely  placed  these  analogy  items  in  8  categories  defined  by  the  type  of  rela¬ 
tionship  between  the  elements.  The  average  item  parameters  for  the  8  categories 
of  analogies  are  given  in  Table  1.  The  easiness  and  item  slowness  parameters 
did  not  vary  significantly  among  the  categories,  but  the  discrimination  parame¬ 
ters  were  significantly  related  to  analogy  type  (F  (7,18)  =  3.16,  p  <  0.05). 

Quantitative  analogies  (e.g.,  cent  .'dime:: . :dollar),  word  pattern  analogies 

(e.g.,  owl : ant ::....: tan) ,  and  functional  analogies  (e.g.,  tree:  man: : sap: . . . . ) 
were  strongly  related  to  the  analogy-solving  ability  being  estimated.  Class¬ 
naming  and  similarity  analogies  (e.g.,  puzzle: . : :riddle:ocean),  which  are 

thinly  disguised  vocabulary  items,  were  not  strongly  discriminating  as  analo¬ 
gies. 


Within  analogy  types  and  within  items  of  the  same  easiness,  the  uj  parame¬ 
ters  described  other  differences  among  the  items.  Some  58%  of  the  examinees 

responded  correctly  to  the  quantitative  analogy  "cent:dime:: . :dollar"  in  a 

median  time  of  9.4  seconds,  with  an  estimated  item  slowness  of  -.13.  Nearly  the 
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Figure  4 

Timed  Testing  Discrimination  Parameters  (a_j ) 

Plotted  Against  Large-Sample  Slope  Parameters 
for  the  Analogies  Items 


0  05  10  15 

slop* 


same  number,  59%  of  the  students,  responded  correctly  to  another  of  the  quanti¬ 
tative  analogies,  " . :yesterday: :tomorrow: today";  but  since  they  expended  an 

average  of  15.5  seconds  on  it,  Uj  was  .28. 

Is  it  possible  that  students  could  compute  money  faster  than  time?  Or  is  a 
rearrangement  of  the  second  analogy  in  order  to  place  the  stem  first  the  time- 
consuming  factor?  In  any  event,  the  consequences  of  this  are  that  an  examinee 
who  responds  to  the  first  analogy  correctly  in  10  seconds  is  average;  but  an 


Table  1 

Mean  Item  Parameters  for  Eight 
Categories  of  Analogies 


Analogy  Category 

a 

ij/ij 

Quantitative 

1.16 

.46 

.08 

Word  Pattern 

.91 

.21 

.01 

Functional 

.90 

.02 

-.23 

Opposites 

.80 

.73 

.14 

Conversion 

.77 

.59 

.03 

Class  Membership 

.67 

1.31 

.12 

Class  Naming 

.53 

-.47 

-.26 

Similarities 

.42 

.05 

.09 
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examinee  who  responds  to  the  second  analogy  correctly  in  10  seconds  is  either 
very  fast  or  very  able.  This  information  will  be  required  by  a  computerized 
adaptive  testing  system,  which  attempts  to  use  latency  as  part  of  a  system  to 
estimate  the  ability  of  examinees.  And  if  it  can  be  determined  why  "yesterdays" 
take  longer  than  "cents,"  the  result  may  contribute  to  the  psychological  under¬ 
standing  of  analogy  items. 

Progressive  Matrices 


The  goodness  of  fit  of  the  Progressi%e  Matrices  data  to  the  timed  testing 
model  was  not  as  good  as  for  the  Verbal  Analogies.  When  the  observed  propor¬ 
tions  of  correct  responses  for  a  similar  set  of  six  ability  groups  were  compared 
to  the  estimated  logistic  proportions  of  correct  responses,  4  of  the  34  individ¬ 
ual  item  X2  tests  were  significant  at  the  p  <  .05  level,  and  the  total  of  the 
X2's  was  201.0  with  136  d_f  (_p  <  .01),  which  indicates  an  overall  significant 
lack  of  fit  of  the  model.  However,  the  problem  seemed  to  be  the  test  items,  not 
the  timed  testing  model,  as  the  significant  X2 ' s  were  all  due  to  items  for  which 
the  observed  proportions  of  correct  responses  did  not  form  a  strictly  monotonic 
increasing  function  over  estimated  ability.  For  most  of  the  items,  the  fit  was 
acceptable;  so  the  significant,  but  small,  lack  of  fit  of  the  item  model  did  not 
seem  to  present  insurmountable  problems. 

The  x2  test  of  the  goodness  of  fit  of  the  Gaussian  distribution  to  the  log 
latency  residuals  was  highly  significant:  The  likelihood  ratio  x2  was  58.5,  with 
7  df  (£  <  .01).  This  indicated  a  substantial  violation  of  the  assumption  of 
lognormal  error  for  the  response  times.  The  problem  appeared  to  arise  from  two 
sources;  the  major  problem  arose  because  in  this  test,  as  in  the  Verbal  Analo¬ 
gies,  correct  responses  were  faster  than  incorrect  responses  (even  after  all  of 
the  model  corrections  for  ability,  easiness,  and  so  on).  But  the  Progressive 
Matrices  were  much  easier  than  the  Verbal  Analogies,  so  almost  three-fourths  of 
the  responses  were  correct,  and  a  little  too  fast.  The  combined  effects  of  the 
75  to  25  mixture  of  correct  and  incorrect  responses  and  their  slightly  different 
error  distributions  made  the  total  distribution  of  errors  around  the  log  latency 
model  somewhat  skewed,  and  that  is  what  the  goodness-of-fit  test  was  detecting. 
This  slight  skewness  (magnified  in  the  x2  statistics  by  the  2,646  residuals  in 
the  distribution)  was  mostly  in  the  middle  of  the  distribution  and  should  have 
little  effect  on  the  parameter  estimates.  There  was  also  a  1%  surplus  of  laten¬ 
cies  that  had  long  positive  residuals  (over  2  SD  above  the  mean);  but  none  of 
those  were  such  outliers  that  they  would  exert  excessive  influence  on  the  param¬ 
eters  . 


Oddly  enough,  given  those  results  from  the  goodness-of-fit  statistics,  the 
prediction  of  the  log  latencies  from  the  timed  testing  model  was  better  for  the 
Matrices  than  it  was  for  the  Verbal  Analogies.  Figure  5  is  a  scatterplot  of 
observed  (t*)  and  fitted  (t*)  log  latencies  for  the  34  Matrices  items  for  the 
same  student  whose  Analogies  data  is  in  Figure  1.  The  correlation  was  .84,  in¬ 
dicating  that  some  70%  of  the  variance  of  log  response  time  within  that  individ¬ 
ual  was  being  captured  by  the  model.  In  part,  this  is  because  there  was  sub¬ 
stantial  regression  of  log  latency  on  effective  ability  and  easiness  for  the 
Matrices  data.  The  b  parameter  was  estimated  at  .8,  and  the  location  parameter 
for  the  log  latencies  was  3.10;  that  means  that  an  item  of  easiness  "0"  should 
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take  an  average  examinee  22.2  seconds,  and  an  item  of  easiness  "1"  would  take 
the  same  examinee  only  10  seconds.  It  also  means  that  the  response  times  were 
used  very  heavily  in  the  estimation  of  effective  ability  and  item  easiness. 

Figure  5 

Scatterplot  of  the  Observed  I.og  Response  Times,  t* , 
and  the  Fitted  Log  Response  Times,  t*, 
for  Student  24  for  Progressive  Matrices 


Some  evidence  that  the  item  easiness  and  slope  parameters  were  estimated 
fairly  well  by  the  timed  testing  model  with  the  aid  of  the  response  times  comes 
from  a  comparison  of  the  item  parameters  obtained  in  this  analysis  with  those  of 
another  large-scale  latent  trait  item  calibration  that  included  a  subset  of 
these  same  Progressive  Matrices  items.  Thissen  (1976b)  estimated  2-parameter 
logistic  item  parameters  for  20  items  drawn  from  the  Progressive  Matrices  Sets 
A,  B,  and  C  administered  to  570  junior  high  school  students.  Eighteen  of  those 
items  formed  a  subset  of  the  34  Matrices  items  included  in  the  present  analysis. 


Figure  6  is  a  plot  of  the  corrected  easiness  parameters  (c_j/aj)  from  the 


present  data  against  the  same  parameter  for  the  same  items  from  the  junior  high 
data;  the  correlation  was  .94.  Figure  7  is  a  similar  scatterplot  for  the  slopes 
(a^j).  The  correlation  in  this  case  -.ms  .61,  and  there  was  some  evidence  of  cur- 

vilinearity,  due  mostly  to  the  fact  that  there  were  4  very  high  slopes  in  the 
earlier  junior  high  calibration  with  only  20  items.  (Frequently,  one  or  a  few 
of  the  slopes  "climb"  in  a  2-parameter  logistic  item  calibration  of  a  test  with 
few  items.  This  did  not  occur  with  this  relatively  large  set  of  34  Matrices 
items.)  Nevertheless,  these  sets  of  discrimination  parameters  correlated  more 
highly  than  the  unmatched  slope  parameters  did  for  the  Verbal  ^  alogies.  The 
correlation  indicates  that  timed  testing  can  yield  reasonably  effective  item 
calibration  with  less  than  100  subjects.  And  it  indicates  that  there  are  dif- 
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Figure  6 

Timed  Testing  Easiness  Parameters 

Plotted  Against  Large-Sample  Easiness  Parameters 
for  Progressive  Matrices 


Easiness 

ferences  in  discrimination  between  items  of  this  sort,  which  are  reliable  across 
changing  sets  of  items,  and  between  tests  administered  to  junior  high  and  col¬ 
lege  students. 

Clocks 


With  only  17  usable  items,  the  Clocks  spatial  test  was  the  shortest  of  the 
three  tests  considered  here  and,  probably  as  a  result,  in  some  ways  the  most 
unstable.  Four  of  the  items  had  estimated  discrimination  parameters  greater 
than  2.0,  and  three  other  items  had  slopes  very  near  0,  indicating  that  the  es¬ 
timation  procedure  took  a  short  test  and  made  it  shorter  by  allowing  a  fraction 
of  the  items  to  dominate  the  scoring.  It  seems  that  the  2-parameter  logistic 
model  only  remains  "democratic"  (uses  all  of  the  items  in  scoring)  as  long  as 
there  is  a  sufficiently  large  "silent  majority"  of  test  items  to  prevent  the 
scoring  procedure  from  freely  reordering  the  examinees  until  a  few  items  have 
near-perfect  discrimination  and  the  rest  are  omitted. 

To  some  extent,  this  has  happened  here;  but  the  estimation  procedure  stop¬ 
ped  short  of  infinite  discriminations.  The  high  discriminations  did  result  in 
some  outliers  among  the  estimated  values  for  9;  four  of  the  78  estimates  were 
between  2.4  and  4.4.  However,  the  high-scoring  students  did  respond  correctly 
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Figure  7 

Timed  Testing  Discrimination  Parameters  (a) 
Plotted  Against  Large-Sample  Slopes 
for  Progressive  Matrices 


a  .3 


to  all  of  the  items  but  one  with  near-zero  discrimination,  and  frequently  re¬ 
sponded  quite  quickly,  indicating  that  they  were  indeed  individuals  of  very  high 
spatial  ability.  So  the  seemingly  extreme  values  may  not  represent  serious  dif¬ 
ficulties  . 

As  measured  by  the  x2  goodness-of-f it  statistics  on  the  item  correct-re¬ 
sponse  x  ability  group  contingency  tables,  there  were  certainly  no  difficulties. 
None  of  the  item  X2'8  were  significant,  and  the  total  of  90.9  with  68  df  was  not 
highly  significant,  either  (f>  =  .05).  The  probability  of  correct  response  as  a 
function  of  effective  ability  seemed  to  be  approximated  fairly  well  by  the  timed 
testing  model  with  its  estimated  parameters. 

The  distribution  of  residuals  from  the  log  latency  model  showed  approxi¬ 
mately  the  same  degree  of  non-normality  as  was  the  case  for  the  Progressive  Ma¬ 
trices;  the  X2  was  39.4  with  7  df_  (£  <  .01),  which  was  highly  significant.  The 
pattern  of  non-normality  was  the  same  as  it  was  for  the  Matrices  data;  most  of 
the  problem  was  caused  by  the  combined  effects  of  three-fourths  of  the  responses 
being  correct  and  the  correct  responses  being  faster.  Again,  the  non-normality 
was  not  due  to  extreme  outliers;  so  the  parameter  estimation,  although  probably 
not  quite  optimal,  should  not  be  seriously  affected.  The  estimated  value  for 
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the  regression  parameter  for  the  Clocks  spatial  test  was  only  .025  (estimated 
SE  =  .003);  while  that  implies  that  there  was  a  significant  relationship  between 
response  time  and  the  probability  of  a  correct  response  on  the  items  of  this 
test,  the  relationship  was  much  smaller  than  for  the  other  two  tests.  It  seems 
that  there  were  individual  differences  in  the  ability  to  respond  correctly  to 
these  items  and  individual  differences  in  speed  of  response;  but,  in  striking 
contrast  to  the  Matrices  test,  those  two  variables  were  unrelated.  Indeed,  the 
correlation  between  8  and  s  for  the  clocks  was  -.03.  (Estimated  correlations 
among  all  of  the  individual  parameters  are  given  in  Table  2.) 

Given  the  low  correlation  between  0  and  s,  it  would  seem  that  only  one  of 
those  variables  could  be  the  classical  spatial  trait.  In  this  case,  that  was  0, 
on  which  there  was  a  significant  sex  difference  of  the  same  magnitude  and  form 
usually  found  on  speeded  spatial  tests  (F(l,76)  =  4.65,  p  >  .05).  Even  allowing 
for  some  moderate  non-normality,  the  males  scored  substantially  higher.  There 
was  no  hint  of  a  sex  difference  on  s  with  the  Clocks.  For  this  test,  with 
three-dimensional  rotations,  it  seems  that  the  classical  spatial  trait  simply 
determined  whether  the  problem  could  be  solved  or  not;  using  more  or  less  time 
seemed  to  make  little  or  no  difference.  This  is  in  marked  contrast  to  the  re¬ 
sults  obtained  earlier  (Thissen,  1976a)  with  another  (simpler,  two-dimensional) 
spatial  test  in  which  both  males  and  females  had  the  same  mean  6  and  in  which 
the  sex  difference  (and  presumably  the  spatial  trait)  was  in  slowness  control¬ 
ling  for  effective  ability.  The  Clocks  may  be  a  good  paper-and-pencil  spatial 
test  because  performance  on  these  items  is  essentially  unrelated  to  slowness  or 
carefulness. 

If  effective  ability  and  easiness  are  the  spatial  trait,  what  are  the  slow¬ 
ness  parameters  doing?  To  answer  that  question,  the  items,  whose  properties  are 

fairly  easy  to  define,  may  be  examined.  The  easiness  of  an  item  seems  inversely 

related  to  the  amount  of  rotation  it  requires,  a  result  reminiscient  of  the 
Shepard  and  Metzler  (1971)  result.  Holding  amount  of  rotation  constant,  the 
item  slowness  parameter  is,  in  part,  an  increasing  function  of  the  number  of 
symbolic  "instructions"  the  item  uses  to  achieve  that  rotation.  The  items  in 
the  test  define  the  rotation  to  be  applied  by  arrows  on  the  surfaces  of  1,  2,  3, 
or  4  spheres.  A  rotation  of  180  degrees  may  be  defined  by  any  of  those  numbers 
of  spheres;  and  in  this  subset  of  items,  it  was.  The  more  spheres  the  item  used 
to  give  the  instruction,  the  longer  it  took  to  encode  (presumably)  and  the  lon¬ 
ger  it  took  for  the  examinee  to  respond.  Regardless  of  the  number  of  spheres 
involved,  however,  the  180  degree  items  had  about  equal  easiness.  Again,  a  com¬ 
puterized  adaptive  testing  system  using  latency  to  estimate  ability  would  have 

to  recognize  that  for  the  Clocks  spatial  test,  more  "instruction  spheres"  do  not 
necessarily  make  the  item  harder  (that  depends  on  the  amount  of  rotation),  but 
they  do  make  the  response  slower. 

Relationships  Among  the  Tests 

Correlations  among  parameters.  The  Progressive  Matrices  have  been  various¬ 
ly  labeled  as  a  test  of  abstract  reasoning  ability,  as  £,  and  as  a  number  of 
other  constructs.  There  is  evidence  in  the  present  data  (see  Table  2)  that  when 
given  no  time  limit,  the  Matrices  become  a  test  of  slowness.  Effective  ability 
and  slowness  on  the  Matrices  were  correlated  .94  in  data;  the  more  slowly 
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a  subject  responded,  the  more  likely  he/ she  was  to  get  the  item  correct,  and 
vice  versa.  Effective  ability  and  slowness  were  not  nearly  so  closely  related 
for  the  Verbal  Analogies  (_r  *  .67),  and  they  were  unrelated  on  the  Clocks.  One 
possible  explanation  is  that  some  kinds  of  test  items  are  affected  more  by  slow¬ 
ness  (and,  possibly,  carefulness)  on  the  part  of  the  examinees  than  are  others. 
This  speculation  has  some  support  in  these  data  in  the  form  of  the  relation¬ 
ships,  shown  in  Table  2,  between  effective  ability  and  slowness  on  the  Verbal 
Analogies  and  the  parameters  of  the  names  on  the  Matrices. 


Table  2 

Estimated  Correlations  Among  the  Person  Parameters 
for  the  Three  Tests 


Test  and 

Analogies 

Matrices 

Clocks 

Parameter 

0  s 

0 

s 

0  £ 

Analogies 

0 

s 

Matrices 

0 

1.00 

.68  1.00 

.54  .69 

1.00 

s 

Clocks 

.39 

.68 

.94 

1.00 

0 

.42 

.28 

.39 

.29 

1.00 

s 

-.09 

.40 

.36 

.55 

-.03  1.00 

Both  0  and  s_  for  the  Progressive  Matrices  may  be  predicted  quite  well  from 
effective  ability  and  slowness  on  the  Verbal  Analogies;  the  multiple  R's  were 
.70  and  .69,  respectively.  The  standardized  regression  coefficients  are  given 
in  Table  3.  Effective  ability  on  the  Progressive  Matrices  was  predicted  by  a 
small  positive  weight  for  effective  ability  on  the  Analogies  and  a  larger  posi¬ 
tive  weight  for  slowness  on  the  Analogies.  Individuals  who  answered  items  slow¬ 
ly  on  the  Analogies  responded  correctly  on  the  Matrices.  Slowness  on  the  Matri¬ 
ces  was  similarly  predicted  by  slowness  on  the  Analogies,  with  a  small  negative 
weight  for  Analogies  effective  ability:  individuals  who  answered  items  entirely 
too  slowly  on  the  Analogies,  given  their  ability  to  respond  correctly,  answered 
very  slowly  on  the  Matrices. 

It  could  be  concluded  that  the  three  tests  in  this  particular  set  represent 
three  types  of  timed  tests.  On  the  Progressive  Matrices,  working  slowly  and 
carefully  was  strongly  related  to  the  probability  of  responding  correctly,  and 
what  is  measured  is  largely  slowness.  On  the  Verbal  Analogies,  working  slowly 
and  carefully  was  related  to  responding  correctly,  but  not  so  strongly;  so  two 
distinct  traits  are  measured — analogical  reasoning  ability  and  slowness.  On  the 
Clocks  slowness  did  not  seem  to  affect  the  probability  of  a  correct  response; 
therefore,  spatial  ability  and  slowness  are  essentially  measured  separately. 

Information.  The  differences  among  the  three  types  of  tests  are  graphical¬ 
ly  portrayed  in  the  test  information  functions  for  0  for  the  three  tests,  shown 
in  Figure  8.  The  latency  data  had  almost  no  effect  on  the  ability  estimation  in 
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Table  3 

Standardized  Regression  Coefficients  for 
Prediction  of  the  Matrices  Scores  from 
the  Analogies  Scores 


Independent  Variable 

Matrices  0 

Matrices  s 

Analogies  0 

.13 

-.13 

Analogies  s 

.60 

.77 

the  Clocks;  as  a  result,  the  test  information  function  for  the  spatial  test  has 
the  classical  "peaked"  form  of  2-parameter  logistic  test  information  functions 
for  tests  not  constructed  to  spread  the  items  very  well.  All  of  the  information 
about  ability  comes  from  the  item  responses,  and  that  information  is  only  sub¬ 
stantial  near  the  difficulty  level  of  the  items,  in  this  case  between  0  =  -1  and 
0. 


Figure  8 

Test  Information  Functions  for  the  Three  Tests 


The  Progressive  Matrices,  on  the  other  hand,  have  a  flat,  regression-like 
test  information  curve.  When  b  is  estimated  to  be  quite  high,  as  it  was  in  the 
case  of  the  Matrices  data  in  this  example,  effective  ability  was  essentially 
estimated  as  though  it  were  an  element  in  a  linear  regression  equation  predict¬ 
ing  log  latency.  The  log  latencies  are  assumed  to  provide  the  same  amount  of 
information  about  effective  ability  regardless  of  the  value  of  0;  therefore,  the 
test  information  curve  is  nearly  flat. 

The  Verbal  Analogies  test  information  function  shows  a  situation  in  which 


I 


information  was  being  obtained  both  from  the  item  response  data  (giving  the 
curve  its  familiar  peaked  form)  and  the  latencies.  The  information  provided  by 
the  response  times  serves  to  raise  the  entire  curve,  so  that  there  is  some  in¬ 
formation  available  in  the  system  for  estimating  the  effective  ability  of  indi- 
viduals  with  relatively  extreme  trait  values.  The  test  information  function  for 
the  Verbal  Analogies  represents  the  desired  outcome  for  the  timed  testing  model; 
the  other  examples  make  it  clear  that  we  are  only  beginning  to  learn  what  can 
happen  in  a  timed  testing  situation. 

Conclusions 


The  latent  trait  model  for  timed  ability  testing  described  here  is  neither 
perfect  nor  complete;  it  still  requires  extensive  (and  expensive)  computation  to 
estimate  the  parameters  of  the  model  for  fairly  small  samples,  and  it  needs  an 
additional  parameter  to  absorb  the  relatively  consistent  difference  between  the 
residuals  from  the  log  latency  model  for  correct  and  incorrect  responses.  Im¬ 
proved  starting  values  for  the  maximum  likelihood  algorithm  would  go  a  long  way 
toward  solving  the  first  problem.  And  the  existence  of  the  second  problem  is 
interesting.  It  is  not  too  surprising  that  averaging  over  people  or  items,  cor¬ 
rect  answers  take  less  time  than  incorrect  ones,  because  correct  answers  come 
from  able  people  answering  easy  items  and  incorrect  responses  come  from  less 
able  people  responding  to  harder  items.  But  this  timed  testing  model  corrects 
for  the  ability  of  the  individuals  and  the  easiness  of  the  items,  and  it  is 
still  true  that  correct  answers  are  associated  with  shorter  latencies.  This 
could  be  a  real  indication  of  some  processing  difference  between  correct  and 
incorrect  responses;  future  research  could  define  the  difference. 

However,  even  with  these  potential  problems,  the  timed  testing  model  is 
ready  for  use.  These  data,  as  a  matter  of  fact,  were  not  collected  to  test  the 
timed  testing  model;  they  were  collected  in  an  investigation  of  cerebral  later¬ 
ality  and  its  relationship  to  cognitive  abilities.  The  timed  testing  model  was 
used  to  score  the  test  because  it  used  the  available  data  most  efficiently.  As 
computers  do  the  testing  and  time  the  responses,  that  will  probably  be  the  case 
with  increasing  frequency. 
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Discussion:  Session  6 


John  B.  Carroll 
University  of  North  Carolina 
at  Chapel  Hill 


Both  the  Thissen  and  the  Tatsuoka  papers  are  excellent,  presenting  inter¬ 
esting  and  useful  approaches  to  a  problem  with  which  1  have  long  been  con¬ 
cerned-— the  role  of  speed  in  testing  ability  and  achievement.  I  can  best  com¬ 
ment  on  them  by  using  them  as  stimuli  for  some  general  remarks  about  speed-abil¬ 
ity  relationships. 

First,  however,  I  will  consider  one  technical  issue  that  is  touched  upon  in 
both  papers — namely,  the  distribution  of  item  response  times  over  items  and/or 
over  individuals.  Thissen  has  stated  that  "data  analytic  considerations  suggest 
that  response  times  are  frequently  approximately  lognormally  distributed,"  and 
his  analyses  consequently  utilize  logarithms  of  these  times.  Tatsuoka,  on  the 
other  hand,  has  provided  evidence  that  response  times  follow  a  Weibull  distribu¬ 
tion  and  has  offered  a  tentative  rationale  for  the  appropriateness  of  such  a 
distribution. 

Before  either  the  lognormal  or  the  Weibull  distribution  for  response  times 
is  accepted,  consideration  should  be  given  to  the  basic  metric  for  these  times. 
It  has  appeared  more  reasonable  to  express  response  latency  in  terms  of  perfor¬ 
mance  per  unit  of  time  (e.g.,  Landahl,  1940;  Wainer,  1977).  This  is  the  conven¬ 
tional  way  of  measuring,  _>i:  example,  the  speed  of  a  vehicle  (miles  or  kilome¬ 
ters  per  hour),  and  it  can  also  be  applied  to  rate  of  work  in  performing  test 
items.  This  metric  has  a  theoretical  zero  point  expressing  a  state  of  no  mo¬ 
tion,  in  the  case  of  a  vehicle,  or  a  state  of  no  performance  at  all,  in  the  case 
of  work  on  a  test.  When  rate  of  performance  is  expressed  in  this  way,  I  have 
generally  found  that  individual  differences  tend  to  be  normally  distributed. 

For  this  reason,  it  has  been  my  practice  to  take  the  reciprocals  of  response 
latencies  and  to  use  these  in  my  data  analyses,  rather  than  the  raw  response 
times  or  even  their  logarithms.  This  has  been  done,  for  example,  for  picture¬ 
naming  latencies  (Carroll  &  White,  1973a,  1973b),  interpreting  the  reciprocals 
as  number  of  pictures  that  could  be  named  per  unit  of  time.  In  reporting  cen¬ 
tral  tendencies  for  response  times,  I  use  the  harmonic  mean,  which  is,  of 
course,  a  back-translation  from  the  arithmetic  mean  of  the  reciprocals  of  the 
response  times. 

From  these  considerations,  I  would  suggest  that  Thissen  (or  anyone  follow¬ 
ing  his  lead)  might  try  to  substitute  the  reciprocal  transformation  for  the  log¬ 
arithmic  transformation  of  response  time.  Possibly  this  would  yield  better  fits 
to  dica,  and  in  any  case  it  would  have  a  somewhat  better  theoretical  underpin¬ 
ning  . 


On  the  other  hand,  I  am  much  attracted  to  the  possibilities  that  seem  to  be 
offered  by  the  Weibull  distribution  as  investigated  by  Tatsuoka.  She  has  gone 
far  in  offering  a  reasonable  rationale  for  the  application  of  this  distribution 
to  time  score  data  in  suggesting  that  the  test  item  is  seen  as  a  "system"  that 
the  individual  tries  to  break  down.  Somehow,  I  have  always  thought  of  the  mat¬ 
ter  in  reverse — that  is,  the  examinee  is  the  system  that  the  test  item  is  trying 
to  break  down.  The  Rasch  model,  incidentally,  presents  the  idea  that  whichever 
will  break  down  first — the  individual  or  the  test  item— is  given  by  the  relation 
between  the  person  parameter  and  the  test  item  parameter.  In  dealing  with  re¬ 
sponse  times,  however,  the  assumptions  of  the  Weibull  distribution  make  it  rea¬ 
sonable  to  consider  the  relation  unidirectional,  in  the  sense  that  one  "waits” 
for  the  examinee  to  "break"  the  test  item.  The  parameter  c  in  Weibull  distribu¬ 
tion  theory  appears  to  be  of  particular  interest,  for  it  specifies  whether  the 
individual's  probability  of  mastering  the  test  item  increases,  remains  constant, 
or  decreases  with  increasing  time.  As  Tatsuoka  noted,  "it  is  intuitively  plau¬ 
sible  that  items  of  all  three  kinds  may  exist  in  practice,  depending  on  the  dif¬ 
ficulty  and  other  properties  of  the  item."  Even  mixed  cases  are  possible,  for 
example,  one  in  which  the  probability  first  increases,  then  decreases,  or  one  in 
which  the  probability  has  a  constant  low  value  and  then  increases  rapidly,  as 
for  a  "sudden  insight"  problem  solution.  Possibly  the  Weibull  distribution 
could  be  applied  to  such  cases  by  assuming  a  two-stage  process,  with  different 
parameters  for  each  stage.  Probably,  however,  the  estimation  of  separate  param¬ 
eters  for  the  stages  would  be  a  formidable  problem. 

The  theme  of  speed-ability  relationships  is  an  old,  but  relatively  ne¬ 
glected,  problem  in  psychometrics.  It  is  not  difficult  to  find  investigations 
of  it  in  the  early  literature  (e.g.,  Dubois,  1932;  McFarland,  1930).  Thurstone 
(1937)  formulated  a  psychometric  model  of  ability,  motivation,  and  speed  involv¬ 
ing  a  three-dimensional  psychometric  surface  in  which  these  variables  could  vary 
independently.  Baxter  (1941)  pointed  out  that  time-limit  and  work-limit  scores 
have  an  artifactual  part-whole  relation  that  accounts  for  their  high  intercorre¬ 
lation  in  most  circumstances  and  discovered  that  sheer  rate  of  work  or  perfor¬ 
mance  on  the  Otis  group  intelligence  test  had  a  correlation  of  essentially  zero 
with  level  of  ability  as  determined  from  work-limit  scores  on  the  test. 

Davidson  and  Carroll  (1945)  pursued  this  matter  further  and  confirmed  these  re¬ 
lations  in  the  case  of  subtests  of  the  Army  Alpha.  It  was  shown  that  several 
different  speed  and  level  factors  could  be  identified  in  the  subtests  of  this 
battery  when  they  were  administered  in  such  a  way  as  to  obtain  not  only  the  con¬ 
ventional  time-limit  scores  but  also  scores  measuring  rate  of  work  and  level  of 
ability.  The  time-limit  scores  were  shown  to  have  factor  loadings  on  both  speed 
and  level  factors. 

Since  the  Davidson  and  Carroll  study,  this  line  of  investigation  has  been 
pursued  only  very  infrequently  in  the  factor  analytic  literature.  One  exception 
is  the  study  by  Lord  (1956),  disclosing  separate  dimensions  of  ability  and  speed 
in  a  number  of  domains  of  cognitive  performance.  One  of  Lord's  findings,  for 
example,  was  that  ability  level  in  spatial  tests  is  a  dimension  separate  from 
speed  in  performing  those  tests.  Recently,  Egan  (1976),  working  at  the  item 
response  level,  found  a  similar  two-dimensional  structure  for  spatial  ability 
tests. 
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Baxter  had  proposed  that  speed,  power,  and  level  scores  for  ability  tests 
should  be  carefully  distinguished;  he  defined  a  speed  score  as  a  sheer  rate-of- 
work  score  without  regard  to  the  correctness  of  response.  Conventional  time¬ 
limit  scores  were  to  be  designated  as  power  scores;  work-limit  scores  (number 
correct  in  unlimited  time)  were  to  be  regarded  as  level  of  ability  scores.  Un¬ 
fortunately,  Baxter's  proposals  were  never  generally  accepted  in  the  psycho¬ 
metric  literature;  most  often,  time-limit  scores  are  called  speed  scores  and 
work-limit  scores  are  called  power  scores.  The  more  unfortunate  error,  however, 
is  to  continue  to  assume  that  conventional  time-limit  scores  are  appropriate 
measures  of  level  of  ability,  when  actually  they  usually  reflect  rate  of  work  to 
a  degree  that  is  a  function  of  the  speededness  of  the  test,  or  more  properly, 
the  time  limit. 

The  confusion  between  ability  and  speed  has  arisen  primarily  because  of  the 
conventional  methods  of  administering  tests  with  time  limits  that  are  often  set 
rather  arbitrarily  but,  in  any  case,  as  short  as  is  deemed  feasible.  Computer¬ 
ized  testing,  whether  it  is  adaptive  or  not,  offers  a  means  of  avoiding  the 
problem  of  speeded  testing  because  it  can  control  the  test  items  offered  to  the 
examinee  and  can  measure  the  time  taken  to  respond  to  them.  What  is  most  in¬ 
triguing  about  Thissen's  paper  is  its  proposed  methodology  for  extracting  both 
ability  and  speed  information  from  computer-administered  tests.  This  methodol¬ 
ogy  seems  to  be  highly  promising.  1  would  be  particularly  interested  if  it 
could  be  developed  to  permit  multidimensional  results  in  either  ability  or  speed 
domains  or  in  both. 

Although  Thissen's  remarks  about  the  interpretation  of  the  ability  and 
speed  parameters  of  the  three  tests  that  he  analyzed  were  of  interest,  his  data 
treatment  might  well  have  included  a  factor  analysis  of  his  table  of  intercorre¬ 
lations.  I  have  taken  the  liberty  of  performing  such  a  factor  analysis,  even 
though  the  iteration  for  communal! ties  had  to  stop  at  10  iterations  to  avoid 
having  at  least  one  of  the  communalities  exceed  unity.  A  Varimax-rotated  solu¬ 
tion  of  the  common  factor  matrix  arrived  at  after  10  iterations  is  shown  in  Ta¬ 
ble  1.  Obviously,  the  two  uncorrelated  factors,  together  accounting  for  about 
77%  of  the  total  variance,  may  be  interpreted  as  ability  and  speed,  respective¬ 
ly.  What  is  of  particular  note  is  that  the  factors  generalize  over  different 
tasks.  The  best  "pure"  measure  of  ability  is  the  parameter  0  for  the  Analogies 
test,  while  the  purest  measure  of  speed  is  the  £  ("slowness”)  parameter  for  the 
Clocks  test.  Nevertheless,  the  factors  show  up  in  interesting  ways  on  other 
tasks. 

Especially  noteworthy  were  the  results  for  the  Raven  Progressive  Matrices 
Test.  There  is  probably  more  confusion  and  conflicting  evidence  in  the  litera¬ 
ture  on  the  Raven  test  than  on  any  other  commonly  used  test.  Many  authors 
(e.g.,  Jensen,  1978)  regard  the  Raven  test  as  one  of  the  best  measures  of  g,  or 
general  intelligence.  Thissen's  results,  however,  indicate  that  the  test  is 
factorially  complex,  measuring  ability  and  speed  in  both  the  0  and  £  parameters. 
Thissen  speculated  that  when  the  Raven  test  is  given  under  no  time  pressure,  it 
is  primarily  a  measure  of  speed  or  its  opposite,  slowness,  or  perhaps  careful¬ 
ness. 
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Table  1 

Varimax  Rotated  Principal  Factor  Analysis 
of  Thissen's  Table  2  of  Correlations  among 
Estimated  Person  Parameters  for  Three  Tests 


Test 

Factor  Loadings 

Factor  1 
(Ability) 

Factor  II 
(Speed) 

Analogies 

0 

.97 

.01 

s 

.62 

.52 

Matrices 

0 

.59 

.71 

s 

.41 

.91 

Clocks 

0 

.45 

.09 

s 

-.09 

.67 

Note.  The  preliminary  principal  component 

analysis  of  the  correlation  matrix  with 
unities  in  the  diagonal  yielded  eigen¬ 
values  of  3.35,  1.30,  0.70,  0.45,  0.16, 
and  0.03.  Iteration  to  two  factors  for 
communalities  stopped  iteration  10  be¬ 
fore  one  of  the  variables  would  have 
attained  a  communality  greater  than 
1.0.  This  matrix  is  presented  as  suf¬ 
ficient  to  exhibit  the  overall  pattern 
of  the  results. 

Horn  (1978)  made  an  extensive  factor-analytic  study  of  speed,  power,  and 
carefulness  (among  other  things),  supported  by  the  Army  Research  Institute.  Two 
of  his  conclusions  are  the  following: 

”1.  There  is  considerable  cohesion  among  indicants  of  average  speediness 
in  providing  response  (either  correct  or  incorrect)  in  tasks  of  non¬ 
trivial  difficulty. 

"2.  Intellectual  speediness  indicated  in  this  manner  has  only  a  very  low, 
perhaps  only  chance,  relationship  to  the  goodness  of  intellectual  per¬ 
formance  that  is  indicated  by  the  number  of  correct  answers  provided 
in  a  wide  range  of  putative  measures  of  intelligence." 

Horn's  study  has  elaborated  the  concept  of  speediness  to  a  much  greater  extent 
than  can  be  reviewed  here;  he  also  considered  the  role  of  strategies  that  exam¬ 
inees  may  adopt  in  attacking  items  and  the  dependence  of  these  strategies  on  the 
character  and  content  of  the  items  or  tasks.  Actually,  his  evidence  suggests 
the  existence  of  two  speediness  factors — CDS  (correct  decision  speed)  and  QDS 
(quit  decision  speed) — the  latter  pertaining  to  situations  in  which  the  individ¬ 
ual  decides  to  give  up  in  a  problem-solving  task. 
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This  result  gives  me  added  confidence  in  suggesting  that  the  methods  devel¬ 
oped  by  Tatsuoka  and  by  Thissen  might  be  applied  in  further  study  of  ability, 
speed,  and  carefulness  factors  at  the  item  level.  Using  a  mixed  model  Weibull 
distribution  with  different  £  parameters,  the  CDS  and  the  QDS  factors  might  be 
more  reliably  differentiated.  One  characteristic  of  items  that  might  be  rele¬ 
vant  here  is  whether  the  items  are  "self-revelatory,"  that  is,  whether  they  have 
a  solution  that  can  be  recognized  as  correct  by  examinees  once  they  discover  it, 
in  contrast  to  the  usual  test  item  for  which  the  correct  answer  is  not  obvious 
to  examinees  unless  they  have  the  required  level  of  ability  or  knowledge.  It 
would  be  profitable  to  apply  Thissen's  methods  to  a  wide  range  of  ability  tests 
in  order  to  determine  the  dimensions  of  ability  and  speed  in  such  tests.  I 
would  expect  the  structure  to  be  much  more  multidimensional  than  Thissen's  pre¬ 
liminary  results  show. 

Most  applications  of  latent  trait  theory  thus  far  are  limited  to  the  unidi¬ 
mensional  case,  or  at  least  to  cases  in  which  unidimensionality  is  assumed. 

There  is  abundant  evidence  that  abilities  in  the  cognitive  domain  are  multidi¬ 
mensional.  It  is  my  hope  that  work  in  latent  trait  theory  in  the  future  can 
address  the  multidimensional  case  more  fully. 
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Using  the  Rasch  Model  to  Identify  Person-Based 
Measurement  Disturbances 


Ronald  J.  Mead 
Minneapolis 


There  is  currently  in  psychometrics  a  controversy  due  to  a  fundamental  dif¬ 
ference  of  opinion  about  what  measurement  is  and  how  it  is  to  be  used.  George 
Bernard  Shaw  (1903)  addressed  an  important  aspect  of  this  controversy  in  "Maxims 
for  Revolutionists"  when  he  wrote,  "The  reasonable  man  tries  to  adapt  himself  to 
the  world;  the  unreasonable  man  persists  in  trying  to  adapt  to  the  world  to  him¬ 
self  _ " 

In  psychometrics  the  "reasonable"  approach  involves  constructing  a  model 
sufficiently  complex  to  explain  any  data  that  might  be  produced  by  a  group  of 
people  taking  a  set  of  test  items.  The  "unreasonable"  approach  fixes  on  a  par¬ 
ticular  model  and  struggles  to  make  data  conform  to  it,  the  choice  of  the  model 
being  determined  by  philosophical  rather  than  empirical  considerations.  To 
avoid  any  confusion  about  where  Shaw  stood  with  respect  to  reasonableness,  he 
went  on  to  say,  "...  therefore,  all  progress  depends  on  the  unreasonable  man." 

The  simple  logistic  model  can  be  viewed  from  either  position.  For  the 
"reasonable"  psychometrician,  it  is  known  as  the  1-parameter  model,  which  is  a 
special  case  of  the  2-  (or  more)  parameter  model;  and  it  should  be  considered 
when  it  fits  the  data,  if  for  no  other  reasons  than  economy  and  parsimony.  This 
viewpoint  has  a  respectable  ancestry  in  model  building  (with  linear  models), 
where  the  magnitude  of  the  unexplained  error  is  the  criterion  for  deciding  if 
parameters  should  be  added  or  deleted. 

For  the  "unreasonable"  psychometrician,  the  simple  logistic  model  is  known 
as  the  Rasch  model.  It  is  the  very  definition  of  measurement;  hence,  measure¬ 
ment  is  not  possible  when  data  do  not  fit  it.  When  viewed  from  this  aspect,  it 
is  a  very  special  model  indeed,  but  is  not  a  special  case  of  anything. 

This  difference  in  philosophy  might  be  compared  to  the  difference  between 
stepwise  multiple  regression  and  experimental  design,  both  of  which  have  been 
very  useful  to  the  social  sciences.  The  first  is  concerned  with  fitting  data 
using  whatever  model  is  necessary;  the  second,  with  organizing  the  situation  to 
obtain  data  that  conform  to  a  model  which  will  make  the  estimation  and  the  in¬ 
ferences  as  easy  and  as  unambiguous  as  possible. 


In  analysis  of  variance  terms,  the  Rasch  model  is  a  main  class  model  with 
two  fixed  classes  and  no  interactions.  All  the  data  needed  to  estimate  the  ef- 
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fects  are  contained  in  the  marginal  sums;  and  since  all  the  marginal  information 
is  needed  in  estimating  the  main  class  effects,  any  additional  parameters,  re¬ 
gardless  of  how  they  are  subscripted,  must  represent  interactions  between  the 
person  and  the  items.  This  raises  the  old  analysis-of-variance  dilemma  of  how 
to  interpret  main  effects  in  the  presence  of  an  interaction. 

Returning  to  psychometrics,  if  the  model  contains  any  additional  parame¬ 
ters,  such  as  item  discrimination,  person  sensitivity,  random  guessing  by  some 
persons,  or  nonzero  asymptotes  for  some  item  characteristic  curves,  then  either 
the  item  characteristic  curves  (ICCs)  or  the  person  characteristic  curves  (PCCs) 
or  both  will  not  be  parallel  (after  linearization).  Hence,  the  comparison  of 
the  abilities  of  two  persons  will  depend  on  the  particular  items  used  as  the 
basis  for  the  comparison.  This  is  a  statement  of  what  is  meant  by  failing  to 
achieve  "specific  objectivity"  as  defined  by  Rasch  (1960),  but  it  is  also  a 
statement  of  an  interaction  as  used  in  analysis  of  variance. 

As  with  interactions,  these  additional  parameters  have  meaning  only  when  at 
least  one  of  the  ways  of  classification  (e.g.,  persons)  can  be  considered  a  ran¬ 
dom  sampling  from  some  relevant  population.  Inferences  about  ability  are  there¬ 
fore  normative  in  the  sense  that  they  pertain  only  to  comparisons  within  that 
population. 

The  most  fundamental  distinction  between  the  Rasch  model  and  the  other 
psychometric  models  is  that  the  Rasch  model  concentrates  on  the  person,  whereas 
the  other  approaches  deal  with  groups  of  people.  Rasch  (1960)  quoted  two  psy¬ 
chologists  on  this  topic.  Citing  Skinner  (1956),  he  stated,  "Any  order  to  be 
found  in  human  and  animal  behavior  should  be  extracted  from  investigations  into 
individuals,  and  (current)  psychometric  methods  are  inadequate  for  such  pur¬ 
poses,  since  they  deal  with  groups  of  individuals.”  He  quoted  Zurbin  (1956)  as 
saying,  "Recourse  must  be  had  to  individual  statistics,  treating  each  patient  as 
a  separate  universe.  Unfortunately,  present-day  statistical  methods  are  entire¬ 
ly  group-centered,  so  that  there  is  a  real  need  for  developing  individual-cen¬ 
tered  statistics."  This  is  no  less  important  in  education.  When  the  intent  is 
to  describe  the  progress  or  achievement  of  one  student,  it  should  not  matter  to 
what  populations  he/she  is  assigned. 

In  solving  this  problem,  Rasch  formulated  some  general  principles  of  com¬ 
parison,  which  can  be  rephrased  as  follows: 

1.  For  any  relevant  item,  a  more  able  person  always  has  a  better  chance  of 
success  than  a  less  able  person,  and 

2.  Any  person  has  a  better  chance  of  success  on  an  easier  item  than  on  a 
more  difficult  one. 

These  statements  indicate  nothing  about  the  age,  sex,  race,  or  religion  of  the 
person,  only  that  the  items  be  appropriate  to  him/her  for  the  variable. 

Although  these  conditions  may  seem  so  obvious  and  necessary  as  to  be  almost 
trivial,  the  family  of  models  proposed  by  Rasch  are  the  only  ones  that  meet 
them.  All  other  models  lack  these  properties,  which  Rasch  has  called  "specific 
objectivity" — objective  because  the  comparison  of  any  two  people  is  independent 
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of  the  items  used  and  specific  because  the  items  must  be  appropriate  for  making 
one  particular  comparison. 

There  are,  'S  Rasch  (1960)  and  Fischer  (1976)  have  shown,  a  large  family  of 
exponential  mode  .5  that  have  specific  objectivity  with  a  variety  of  types  of 
observations.  The  only  one  to  be  discussed  in  this  paper  is  the  model  for  di¬ 
chotomous  ly  scored  items,  which  looks  like  the  Birnbaum  model  and,  as  mentioned 
previously,  is  c<-  uonly  known  as  the  Rasch  model.  Because  it  does  have  the  po¬ 
tential  of  achieving  objectivity,  however,  it  is  not  a  special  case  of  the  Birn¬ 
baum  model,  and  choosing  between  them  should  be  on  the  basis  of  whether  or  not 
objectivity  is  of  value. 


Person-Based  Disturbances 


There  are  two  major  concerns  motivating  an  interest  in  the  analysis  of  per¬ 
son  fit.  First,  the  simplicity  of  the  Rasch  model  makes  it  very  demanding  on 
the  data.  Although  some  very  desirable  measurement  properties  are  associated 
with  the  model,  they  are  attained  only  if  the  data  fit.  A  thorough  search  for 
misfit  is  therefore  essential. 

Second,  since  measurement  of  the  person  is  the  objective,  it  is  only  pru¬ 
dent  that  before  any  decisions  are  made  about  the  person  (based  on  a  series  of 
simple  responses  to  artificial  situations),  it  be  verified  that  the  responses 
mean  what  is  intended.  Regardless  of  how  well  or  how  often  the  items  have  been 
used  in  the  past,  there  is  no  guarantee  that  they  operated  as  planned  with  a 
particular  person  on  a  specific  occasion. 

It  is  preferable  to  use  the  Rasch  model  for  this  purpose  precisely  because 
of  its  simplicity  and  because  of  the  logical  relationship  between  its  demands 
and  its  benefits.  If  the  demands  are  met,  the  benefits  necessarily  follow. 

Since  it  is  simple,  it  will  not  appear  to  explain  data  that  were  generated  by  a 
multidimensional  process.  When  it  fits,  we  know  exactly  what  we  have.  When  it 
does  not  fit,  at  least  we  know  what  has  been  taken  out;  therefore,  all  the  in¬ 
formation  about  misfit  should  be  left  in  the  residuals  from  the  model. 

In  order  to  use  the  model  to  understand  what  could  have  gone  wrong  when  the 
data  do  not  fit,  some  consideration  must  be  given  to  how  various  forms  of  misbe¬ 
havior  might  appear  in  the  data.  It  must  be  remembered  that  when  the  data  do 
not  fit,  the  same  logical  position  does  not  exist.  Although  it  can  be  predicted 
how  the  data  will  look  if  certain  things  happen,  it  does  not  follow  that  if  the 
data  look  that  way,  these  things  have  necessarily  happened. 

Three  much  discussed  disturbances  will  be  considered  in  this  paper:  random 
guessing,  speededness,  and  bias.  These  are  relatively  general,  since  many  other 
problems  can  be  stated  analogously  and  all  can  be  handled  by  a  single  strategy. 
In  addition,  it  will  be  suggested  how  the  fitted  ICCs  might  be  affected  if  a 
substantial  proportion  of  the  sample  were  engaging  in  these  activities. 

Random  Guessing 

This  can  only  occur  (1)  when  the  person  has  no  knowledge  that  would  help 
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him/her  choose  a  response  or  (2)  when  he/she  does  not  read  the  question  before 
responding.  Not  everyone  will  guess  randomly,  and  among  those  who  will,  the 
propensity  to  do  so  is  not  necessarily  the  same.  One  reasonable  description  of 
this  process  is  as  follows:  If  the  person  has  a  reasonable  amount  of  knowledge 
about  the  item,  he/she  will  respond  according  to  the  model  (solid  ogive  in  Fig¬ 
ure  la).  At  some  level  of  difficulty,  the  person  will  decide  that  he/she  knows 
nothing  about  the  item  and  will  respond  randomly,  with  the  probability  of  suc¬ 
cess  (dashed  line)  being  determined  by  the  number  of  alternatives.  The  point 
(Gy  on  the  figure)  at  which  this  change-over  occurs  surely  varies  from  person  to 
person  according  to  confidence  in  ability  and  tendency  toward  risk-taking. 

The  residuals,  after  removing  the  effects  of  the  item  difficulties  and  the 
person's  ability,  are  as  shown  in  Figure  lb.  The  positi\  residuals  for  the 
difficult  items  imply  a  surprising  degree  of  success  on  these  items.  Surprise 
increases  in  moving  toward  more  and  more  difficult  items,  and  the  rate  of  suc¬ 
cess  remains  the  same.  The  negative  residuals  correspond  to  responses  that 
should  fit  the  model,  but  the  expectation  was  upset  by  the  person's  unwarranted 
successes  on  the  difficult  items.  Dropping  the  difficult  items  from  consider¬ 
ation  and  recomputing  the  person's  ability  should  result  in  an  acceptable  mea¬ 
surement  . 

If  a  substantial  proportion  of  the  sample  were  behaving  in  this  way,  it 
would  be  expected  that  ICCs  would  be  as  shown  in  Figure  2.  Estimates  of  item 
difficulty  would  be  biased  downward  (that  is,  the  item  would  be  considered  easi¬ 
er  than  it  actually  is).  Heterogeneity  in  the  item  discriminations  would  also 
be  observed,  with  the  more  difficult  items  appearing  to  have  the  lower  discrimi¬ 
nations.  The  extent  of  the  disturbance  in  both  would  depend  on  the  propensity 
of  the  sample  to  guess  and  on  the  proportion  of  the  sample  in  a  position  where 
guessing  was  a  viable  strategy.  The  methods  devised  by  Waller  (1974,  1976)  to 
correct  for  guessing  should  effectively  eliminate  both  problems. 

Speededness 


It  frequently  happens  with  group-administered  tests  that  not  everyone  has 
ample  time  to  completely  answer  every  item.  Although  time  limits  are  normally 
chosen  to  minimize  this,  they  invariably  involve  some  compromise  between  admin¬ 
istrative  convenience  and  handicapping  a  few  persons. 

How  a  time  limit  affects  a  person  undoubtedly  depends  on  the  individual. 

The  person  might  simply  rush  through  the  test  without  spending  enough  time  on 
any  item.  The  residual  response  string  would  have  the  appearance  of  both  random 
guessing  and  carelessness,  with  some  difficult  items  answered  correctly  and  some 
easy  items  answered  incorrectly.  The  effect  on  the  estimate  of  ability  would  be 
to  underestimate  it.  Except  for  a  general  fuzziness  in  the  quality  of  the  mea¬ 
surement,  this  situation  would  be  difficult  to  detect  and  diagnose  psychometri- 
cally. 

A  slow  methodical  person  might  carefully  consider  each  item  before  respond¬ 
ing  and  consequently  could  leave  several  unanswered  at  the  end.  Detecting  this 
does  not  require  high-powered  psychometrics.  Skillful  test-takers  might  compli¬ 
cate  this  picture  by  filling  in  as  many  answers  as  possible  between  the  time  the 
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Figure  2 

Typical  Item  Characteristic  Curve  with  Guessing 


proctor  says  stop  and  when  the  answer  sheet  is  taken  away.  The  response  string 
should  be  acceptable  up  to  the  point  at  which  the  behavior  changes.  From  there 
on  it  should  have  the  appearance  of  random  guessing. 

Deciding  what  to  do  about  either  of  these  cases  requires  some  consideration 
of  the  nature  of  the  variable  being  sought.  It  is  sometimes  argued  that  the 
capability  of  performing  the  tasks  rapidly  is  an  important  component;  so  all 
items  presented  to  the  person  should  be  considered  in  the  measurement.  However, 
the  ability  to  do  something  and  the  ability  to  do  it  quickly  are  not  necessarily 
the  same.  They  may  both  be  important  and  are  often  highly  correlated;  but  if 
they  are  different  variables,  they  can  not  be  combined  into  a  single,  valid  mea¬ 
sure  and  treated  with  a  unidimensional  model. 

The  effect  of  ignoring  the  speededness  on  the  estimate  of  item  statistics 
would  be  to  make  the  items  at  the  end  of  the  test  appear  more  difficult  and  more 
discriminating  than  they  actually  are.  They  are  too  difficult  because  too  few 
people  responded  to  them  correctly,  and  the  discriminations  are  too  high  because 
the  persons  who  did  respond  correctly  will  tend  to  have  more  correct  answers  on 
the  whole  test  (because  they  actually  took  a  longer  test). 

Bias 


An  item  is  biased  against  a  person  if,  for  some  reason,  the  person  is  at  a 
disadvantage  on  that  item  relative  to  other  items  and  other  persons.  This  must 
involve  an  additional  latent  variable,  and  the  effect  of  the  person's  position 
on  this  variable  is  to  lower  his/her  chances  of  success  on  the  affected  items. 
One  familiar  example  is  a  vocabulary  test  in  which  the  word  "sonata"  is  found  to 
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be  biased,  since  not  all  cultural  groups  would  have  had  equal  exposure  to  it. 

Most  discussions  of  item  bias  (e.g.,  Draba,  1978)  in  connection  with  the 
Rasch  model  have  suggested  techniques  such  as  the  following: 

1.  Calibrate  all  items  on  each  group  separately; 

2.  Compare  the  resulting  item  difficulties; 

3.  Items  which  display  significant  shifts  in  difficulty  between  groups  are 
considered  potentially  biased;  and 

4.  If  content  experts  concur,  the  items  are  revised  or  dropped. 

Although  this  approach  is  very  useful,  it  is  not  the  complete  answer.  It  has 
two  obvious  shortcomings.  First,  it  depends  on  internal  definition  of  the  true 
variable.  It  can  only  work  well  if  most  items  are  "fair."  If  the  bias  is  pre¬ 
sent  uniformly  in  all  items,  the  unfavored  group  will  simply  appear  lower  in 
ability.  Second,  this  analysis  is,  again,  population-based.  It  requires  the 
definition  of  groups  that  must  be  arbitrary  to  some  extent  and  then  assumes  that 
any  bias  in  the  items  is  the  same  for  all  members  of  the  group.  It  seems  pref¬ 
erable  to  treat  every  person  as  his/her  own  group  and  to  require  each  item  to 
demonstrate  its  validity  for  every  person. 

This  can  be  accomplished  by  rearranging  the  steps  and  bringing  the  content 
experts  in  first.  Their  function  would  be  to  cluster  the  items  according  to 
some  criterion  so  that  the  subsets  would  be  homogeneous  with  respect  to  any  sus¬ 
pected  extraneous  variables.  This  might  be  on  the  basis  of  vocabulary,  as  sug¬ 
gested  earlier;  or  in  the  case  of  reading  comprehension,  it  might  be  on  the  ba¬ 
sis  of  the  subject  matter  of  the  passages. 

To  take  another  example,  if  a  mathematics  reasoning  test  were  comprised  of 
word  problems,  the  items  might  be  grouped  by  their  readability.  In  this  case, 
the  person  obviously  has  two  abilities  and  the  item  two  difficulties.  The  per¬ 
son  can  be  successfully  measured  on  one  of  the  variables  only  if  his/her  perfor¬ 
mance  is  not  affected  by  the  other.  If  the  person  is  sufficiently  skillful  at 
reading,  he/she  will  read  and  understand  all  problems,  regardless  of  the  prob¬ 
lem's  resistance  to  being  read.  Therefore,  the  person's  performance  will  be 
determined  by  his/her  reasoning  ability  and  the  difficulty  of  the  problem.  It 
would  not  matter  if  the  items  were  equally  difficult  to  read  or  not. 

On  the  other  hand,  if  the  person  were  a  very  skillful  problem  solver  and  a 
poor  reader,  it  would  be  a  test  of  his/her  reading  ability.  His/her  performance 
would  depend  on  whether  or  not  he/she  could  read  and  understand  the  problem.  If 
the  calibrating  sample  were  like  this,  the  difficulties  assigned  to  the  items 
would  be  due  to  their  locations  on  the  reading  variable. 

The  discriminations  observed  for  the  items  could  be  influenced  in  either 
direction.  One  interesting  situation  would  arise  if  the  calibration  sample  were 
comprised  of  two  groups,  equally  able  in  problem  solving  but  substantially  dif¬ 
ferent  in  reading  ability.  Assuming  that  the  first  group  had  no  difficulty 
reading  any  of  the  items  but  that  the  second  had  trouble  with  several,  those 
items  will  have  apparent  discriminations.  Since  the  groups  are  the  same  in  rea¬ 
soning,  they  will  only  be  separated  by  the  difficult  reading  items;  and  for  pur- 
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poses  of  this  paper  the  high  discriminations  would  be  entirely  spurious. 
Multidimensionality 


Clearly,  the  problem  here  is  that  the  dimensionality  of  the  latent  space  is 
greater  than  the  dimensionality  of  the  model.  In  practical  applications  it  is 
more  convenient  to  try  to  control  the  situation  rather  than  to  generalize  the 
model.  This  conception  of  bias  can,  however,  be  generalized  to  include  any 
strategy  for  forming  subsets  of  the  items.  For  example,  difficult  items  could 
be  considered  biased  against  nonguessers ;  easy  items,  against  careless  test 
takers.  Items  at  the  end  of  a  test  are  biased  against  slow  workers;  those  at 
the  beginning,  against  slow  starters.  Items  that  have  never  been  used  before 
are  biased  against  examinees  who  belong  to  fraternities  with  good  test  files. 
There  never  seems  to  be  any  problem  for  people  who  are  interested  in  a  test  to 
generate  hypotheses  about  possible  dimensions  in  it. 

At  this  point  a  simple  strategy  can  be  described  for  checking  each  person's 
response  string  for  multidimensionality.  This  person  analysis  is  not  a  replace¬ 
ment  for  a  thorough  item  analysis  but,  rather,  is  an  addition  to  it.  The  hy¬ 
potheses  here  about  Person  x  Item  subset  interactions  are  distinct  from  the  hy¬ 
potheses  in  item  analysis  about  Item  x  Person  Group  interactions.  The  power  of 
the  person  analysis  comes  from  replicating  items  that  are  alike  in  some  sense; 
in  item  analysis  it  depends  on  replicating  similar  people. 

The  principle  employed  in  the  analysis  is  the  objectivity  of  the  measure. 

If  it  is  truly  objective,  then  all  subsets  of  items  should  yield  statistically 
equivalent  estimates  of  the  person's  ability.  If  not,  it  would  be  concluded 
that  the  presence  of  multidimensionality  is  related  to  the  manner  in  which  the 
subsets  were  defined. 

The  most  obvious  method  of  doing  the  arithmetic  would  be  to  actually  com¬ 
pute  the  ability  associated  with  the  person's  score  on  each  subtest  and  to  per¬ 
form  an  analysis  of  variance  to  test  for  between-subtest  differences.  This  has 
the  immediate  drawback  of  being  unable  to  deal  with  0  or  perfect  scores.  Gus- 
tafsson  (1979)  has  recently  proposed  a  set  of  procedures  based  on  conditional 
maximum  likelihood  estimation,  which  has  many  desirable  statistical  properties. 
This,  however,  can  be  expensive,  and  it  is  still  somewhat  restrictive  in  the 
maximum  number  of  items  it  can  handle. 


A  convenient  and  economical  analysis  can  be  developed  from  the  uncondition¬ 
al  estimation  equations.  The  basic  equations  needed  are  shown  in  Table  1.  The 
notation  used  follows  the  conventions  of  Rasch  where  possible: 


v_  designates  the  person, 
i  designates  the  item, 

X  ^  is  the  score  obtained  by  person  v^  on 

item  _i_  (equals  0  if  incorrect,  1  if  correct), 
b  is  the  ability  of  person  v  estimated  from 
all  items,  and 

d^  is  the  estimated  difficulty  for  item  i_. 
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Table  1 

Unconditional  Person  Analysis 


Scaled  Residual 


'  .  =  (X  .  -  P  .) /W  . 
vt  ut  vt  vt 


where 


=  exp  (bv  -  dP)  I  (1.0  +  exp  (fcy  -  d ^))  , 


and 


f/  . 
Vt 


p  .  (1  -  p  .) 

ui  v  r>  ^ ' 


Misfit  Due  to  Subtest  J 

v  .  =  y  yz  .w  ./  y  w  . 

Vj  .L-‘.  *  Vt  Vt  vt 

te,7  tej 


Effect  Due  to  Subtest  J 

Y  .  =  y  y  .  W  .  /  y  W  . 

VQ  aVt  Vt  Vt 

tzj  tej 


Between-Sub tes t  Mean  Square 

v  =  y '  y2  .  y  w  ./  yw  . 

VB  7  v0  iej  Vt  i  VV 


The  scaled  residual  is  simply  the  difference  between  the  person's  observed 
item  score,  Xv£ ,  and  his/her  predicted  item  score,  pvi>  predicted  from  his/her 
performance  on  the  total  test.  The  difference  has  been  rescaled  by  multiplying 
by  1 . 0/ (Pvix( 1-Pvi ) ) i  which  is  the  derivative  of  l>  with  respect  to  P,  so  that 
the  residual  is  expressed  in  logits  to  a  first-order  approximation. 

The  misfit  statistic  has  the  form,  but  not  the  distribution,  of  a  sum  of 
squared  z^-statistic ;  that  is,  it  is  the  sum  of  X  minus  P2,  divided  by  the  sum  of 
PQ.  It  will  be  large  when  the  ability  does  not  adequately  explain  the  person's 
part  in  every  Xv^. 

The  effect  due  to  subtest  j_  is  simply  the  first  adjustment  that  would  be 
made  if  estimating  the  person's  ability  for  subtest  j_  were  attempted  using  the 
total  test  ability  as  the  starting  value.  This  form  is  the  Newton-Raphson  solu¬ 
tion  to  the  unconditional  maximum  likelihood  estimation  equations.  Since  it  is 
not  iterated,  there  is  no  problem  with  zero  or  perfect  scores.  The  between-sub- 
test  mean  square  asks  if  all  these  effects  are  null. 

There  are  some  problems  with  these  statistics.  In  particular,  neither  the 
form  of  their  null  distribution  nor  the  appropriate  degrees  of  freedom  is  known. 
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Wright  (1979)  and  Haberman  (1979)  have  been  investigating  various  weighting  and 
standardizing  schemes  with  the  hope  of  bringing  the  distribution  into  line  with 
a  standard  distribution.  It  seems  important  from  their  work  that  the  numerators 
and  denominators  be  summed  separately,  as  they  are  in  Table  1. 

The  degrees  of  freedom  are  a  problem,  since  the  usual  analysis  of  variance 
type  counting  assumes  every  observation  contains  the  same  amount  of  information, 
which  clearly  does  not  apply  here.  One  promising  candidate  seems  to  be  to  base 
a  calculation  of  pseudo-degrees  of  freedom  on  the  information  function.  In  some 
cases  this  may  be  as  simple  as  4.0  x  pq. 

Additional  work  is  needed  in  this  area.  Two  obvious  and  useful  activities 
are  the  careful  simulation  of  known  situations  over  a  broad  range  of  conditions 
and  a  comparison  of  these  statistics  with  those  produced  by  the  conditional  ap¬ 
proach  of  Gustafsson. 

In  practice,  however,  these  studies  are  of  secondary  interest.  Since  it  is 
well  known  that  data  do  not  fit  the  Rasch  model  anyway,  it  is  of  marginal  utili¬ 
ty  to  continue  demonstrating  this.  What  is  useful  is  a  general  index  of  the 
quality  of  measurement  for  each  person,  that  is,  an  indication  of  how  close  the 
data  reflect  objectivity.  A  weighted  fit  mean  square  based  on  the  scaled  resid¬ 
ual  in  Table  l , 
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seems  to  accomplish  this.  Following  that,  some  specific  statistics  are  needed 
to  help  diagnose  the  problems  when  they  occur.  The  unconditional  between-set 
analysis  shown  in  Table  1  has  proved  useful  for  this  in  a  variety  of  applica¬ 
tions.  The  additional  research  work  needed  includes  the  application  of  statis¬ 
tics  like  these  to  real  data  and  the  interpretation  of  the  results  to  knowledge¬ 
able  people  to  see  if  they  make  sense. 

Examples  of  the  Person  Analysis 


Certification  Examination  Data 


The  first  example,  shown  in  Table  2,  is  taken  from  an  actual  administration 
of  a  professional  certification  examination.  The  examination  included  about 
1,000  items,  some  of  which  were  omitted  in  scoring.  The  items  were  arranged  in 
six  booklets,  intended  to  be  as  parallel  as  possible,  and  administered  over  two 
days.  Six  test  committees,  operating  independently  and  representing  six  differ¬ 
ent  content  areas,  actually  wrote  the  items.  Subject  to  printing  consider¬ 
ations,  the  six  areas  were  distributed  evenly  over  the  six  booklets.  The  certi¬ 
fication  decision  was  based  on  the  total  raw  score. 

The  only  justification  for  considering  this  test  to  be  unidimensional  is 
empirical.  After  analyzing  literally  tens  of  thousands  of  examinees,  less  than 
1%  appear  to  be  seriously  flawed.  This  may  be  attributed  to  the  homogeneity  of 
the  training  program. 
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Table  2 

Among  and  Within  Item  Subset  Analyses  For  Example  1 


NAME 

COUNT 

SCORE 

ABILITY 

S.E 

WTHIN 

BETWN 

ADMINISTRATIVE  BOOK  0.15 

0.02 

5.8 

14.0 

+ 

BK  1 

144. 

29. 

-1.54 

0.23 

5.8 

- * - 

+ 

BK  2 

157. 

96. 

0.42 

0.18 

0.7 

+ — * — 

BK  3 

149. 

81. 

0.45 

0.  18 

0.3 

+ - * - 

BK  4 

139. 

82. 

0.34 

0.  19 

1.1 

+  -★ - 

BK  5 

149. 

90. 

0.59 

0.  19 

1.6 

+  -* - 

BK  6 

130. 

74. 

0.49 

0.20 

0.3 

+ - * - 

NAME 

COUNT 

SCORE 

ABILITY 

S.E. 

WTHIN 

BETWN 

0.15 

0.08 

5.8 

-1.3 

+ 

AAAA 

150. 

70. 

0.09 

0.  18 

1.0 

-*+- 

BBBB 

149. 

79. 

0.35 

0.  18 

2.0 

+-* — 

CCCC 

146. 

83. 

0.00 

0. 19 

1.8 

— *+ 

DDDD 

139. 

77. 

0.06 

0.  19 

2. 1 

— *+- 

EEEE 

144. 

73. 

0.25 

0.  19 

1.6 

-+* - 

FFFF 

140. 

70. 

0.  10 

0.  19 

5.7 

- *- 

NAME 

COUNT 

SCORE 

ABILITY 

S.E. 

WTHIN 

BETWN 

ITEM 

TYPE 

0.  15 

0.00 

5.8 

2.2 

+ 

A 

239. 

129. 

0.  18 

1.14 

4.9 

-* - 

B 

229. 

128. 

0.24 

0. 15 

2.9 

-+*- 

C 

140. 

67. 

0.01 

0.  19 

1.3 

— *+_ 

K 

174. 

80. 

0.06 

0.17 

2.3 

— *+- 

N 

30. 

8. 

-1.04 

0.45 

-0.2 

- * - 

+ 

G 

56. 

40. 

0.88 

0.  32 

-0.2 

+  — * - 

NAME 

COUNT 

SCORE 

ABILITY 

S.E 

WTHIN 

BETWN 

NEW  OR  USED 

ITEMS 

0.  15 

0.08 

5.8 

-0.1 

+ 

NEW 

481. 

238. 

0.08 

0. 10 

5.9 

-*+ 

USED 

387. 

214. 

0.22 

0.11 

2. 1 

+*- 

NAME 

COUNT 

SCORE 

ABILITY 

S.E 

WTHIN 

BETWN 

DIFFICULTY 

0.15 

0.08 

5.8 

4.7 

+ 

-2.0 

48. 

39. 

-1.14 

0.39 

9.4 

- * - 

+ 

-1.0 

106. 

80. 

-0.27 

0.22 

2.0 

— * —  + 

0.0 

237. 

149. 

0.08 

0. 14 

0.4 

-*+- 

1.0 

314. 

136. 

0. 19 

0.11 

0.7 

-* - 

2.0 

163. 

48. 

0.59 

0. 17 

5.1 

+  -* — 

r 

» 


f 


\ 


» 
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The  method  that  was  used  to  decide  which  tests  were  flawed  was  both  statis¬ 
tical  and  substantive.  All  interesting  fit  statistics  were  computed,  their  dis¬ 
tributions  examined,  and  the  suspicious  cases  displayed  for  discussion  by  a  task 
force  selected  for  that  purpose.  Beginning  with  the  largest  misfit  and  working 
down,  the  task  force  examined  each  case  in  order  until  it  was  satisfied  that  the 
statistics  were  best  explained  as  minor  random  fluctuations.  In  general,  this 
occurred  at  about  three  standard  deviations  above  the  mean.  The  statistics  were 
presented  as  pseudo-t_  statistics  rather  than  mean  squares,  to  keep  the  standards 
as  consistent  as  possible.  In  all  cases  the  means  were  near  0  and  the  standard 
deviations  about  2. 

Table  2  is  a  portion  of  the  display  for  the  person  with  the  largest  be- 
tween-subtest  statistic  (14.0)  of  any  in  this  administration.  Each  panel  in 
this  display  represents  a  different  criterion  for  defining  subtests  (i.e.,  test 
booklet,  item  type).  In  all  cases  the  column  labeled  "ABILITY"  presents,  first, 
the  estimate  of  the  person's  ability  based  on  the  total  test  (.15),  followed  by 
the  ability  based  on  each  subset.  The  fit  statistics  are  in  the  columns  "WTHIN" 
(within)  and  "BETWN"  (between).  The  total  fit  statistic  is  the  first  number  in 
the  WTHIN  column  and  the  between-subtest  statistic  is  in  the  BETWN  column.  The 
remaining  numbers  in  the  WTHIN  column  are  the  fit  statistics  within  each  subset. 
They  are  analogous  to  total  only  if  that  subtest  were  being  considered. 

The  explanation  for  the  large  between-booklet  statistic  is  simple.  A  sepa¬ 
rate  answer  sheet  was  used  for  each  booklet.  The  first  one  for  this  person  was 
torn  slightly,  causing  the  scanner  to  misread  the  form  identification,  so  the 
result  was  an  essentially  random  score  for  that  booklet.  Rescoring  this  sheet 
correctly  eliminated  all  disturbances  in  this  record. 


Table  3 

Fit  Statistics  for  Example  2 


Ability 

-1.09 

SE 

0.37 

Total  Fit 

0.0 

Sequence 

Between  Subsets 

2.4 

Within  Subsets 

1.1 

Di f ficulty 

Between  Subsets 

-5.9 

Within  Subsets 

0.2 

Mathematics  Placement  Test  Data 

For  those  who  can  not  depend  on  having  a  1,000-itera  test,  Table  3  and  Fig¬ 
ure  3  give  a  different  type  of  example.  This  was  from  a  mathematics  placement 
test  for  beginning  college  freshman.  It  consisted  of  40  items  in  two  separately 
timed  segments.  The  reward  for  doing  well  was  placement  in  a  more  difficult 
mathematics  course. 
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The  fit  statistics  were  again  pseudo-^' s,  but  the  standardization  was  done 
differently  so  that  although  the  observed  mean  was  again  near  0,  the  standard 
deviation  was  about  .6.  The  item  subsets  were  defined  by  sequence  and  by  diffi¬ 
culty,  with  four  subsets  in  each.  The  fit  between  subsets  defined  by  sequence 
and  difficulty  were  both  large  enough  to  be  interesting.  The  plot  by  sequence 
explains  the  problem:  The  person  took  only  the  first  half  of  each  segment. 

These  statistics  do  not  indicate  if  the  problem  was  due  to  lack  of  ability,  lack 
of  interest,  or  the  inability  to  read  English;  and  the  decision  regarding  appro¬ 
priate  action  on  the  part  of  the  test  user  requires  answers  to  these  questions. 

The  difficulty  fit  statistic  was  large  negatively  because  the  person  ap¬ 
peared  too  "sensitive"  to  item  difficulty.  The  items  were  arranged  in  roughly 
increasing  order  of  difficulty  in  each  segment.  Therefore,  by  only  attempting 
half  the  test,  the  person  tended  to  answer  the  easier  items  correctly  and  to 
answer  all  the  more  difficult  ones  incorrectly. 

Cone lus ions 

There  is,  in  practice,  an  important  distinction  between  what  might  be 
called  statistical  dimensions  and  conceptual  dimensions.  Statistical  dimensions 
are  any  that  can  be  found  in  the  data;  conceptual  dimensions  are  everything  that 
can  be  thought  of.  Successful  measurement  requires  the  former,  but  safety  re¬ 
quires  constantly  checking  to  be  sure  the  latter  exists  only  in  our  heads. 

For  example,  in  a  homogeneous  situation  it  might  be  possible  to  mix  togeth- 
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er  verbal  items  and  quantitative  items.  If  the  same  results  are  obtained  re¬ 
gardless  of  the  proportion  of  each,  the  result  could  be  called  "verbal  ability," 
"quantitative  ability,"  or  "general  intelligence."  With  every  test  administra¬ 
tion,  however,  it  would  be  necessary  to  prove  that  it  still  does  not  matter — 
that  whatever  causes  these  things  that  seem  different  to  look  the  same  is  still 
operating. 

Consideration  of  how  this  can  be  done  leads  to  the  observation  that  statis¬ 
tical  unidimensionality  is  necessary  and  sufficient  for  fit  to  the  Rasch  model. 
The  necessity  need  not  be  argued  here.  With  respect  to  sufficiency,  however,  if 
the  data  are  unidimensional,  a  model  of  the  Rasch  family  will  fit  it. 

To  support  this  rather  remarkable  contention,  two  arguments  can  be  made. 

The  first  is  that  the  Rasch  model  is  the  only  latent  trait  model  for  which  it  is 
possible  to  uniquely  and  unambiguously  rank-order  all  the  objects  and  agents 
along  a  single  continuum.  With  this  ordering,  unidimensionality  is  obvious. 
Without  it,  phrases  such  as  "more"  or  "less  of  the  variable"  do  not  make  much 
sense,  since  there  will  be  situations  in  which  people  do  better  on  a  more  diffi¬ 
cult  item  than  on  an  easier  one.  The  ability  to  rank  individuals  and  items  when 
a  single  variable  is  involved  and  the  inability  to  generalize  this  concept  to 
more  than  one  dimension  seems  a  critical  point  in  this  discussion. 

The  second  argument  concerns  item  discrimination.  The  first  one  may  have 
been  objected  to  on  the  grounds  that  variation  in  item  discrimination  does  not 
make  the  data  multidimensional  but  does  make  it  not  fit  the  Rasch  model. 
Throughout  this  paper  the  point  has  been  made  that  many  extraneous  variables 
will  cause  apparent  heterogeneity;  but  that  is  the  reason  for  hesitation  in  us¬ 
ing  item  discriminations,  not  why  they  should  not  be  used. 

Items  with  extreme  observed  discriminations  can  always  be  explained  in 
terms  of  additional  variables.  Usually,  such  items  are  obviously  flawed:  Some 
irrelevant  and  perhaps  nonreproducible  aspect  of  the  item  has  interacted  with 
special  characteristics  of  the  sample.  Occasionally,  these  items  provide  a  use¬ 
ful  and  constructive  insight  into  the  variable.  Generating  new  items  to  take 
advantage  of  this  new  knowledge  may  very  well  lead  to  a  refined  (i.e.,  changed) 
definition  of  the  variable  that  serves  our  purposes  better  than  the  old.  An 
important  point  here  is  the  ability  to  abstract  whatever  it  is  that  distin¬ 
guishes  this  item  and  to  use  it  in  the  new  items.  It  would  then  be  expected 
that  the  new  items  would  have  discriminations  that  are  similar  to  one  another. 

This  paper  has  attempted  to  make  two  statements  about  dimensionality  and 
the  Rasch  model.  First,  an  explanation  has  been  suggested  of  why  such  a  simple 
model  has  worked  as  well  as  it  has  in  many  complex  situations  employing  such 
artificial  agents  as  multiple-choice  items.  Second,  the  unique  relationship 
between  the  model  and  unidimensionality  has  been  described.  Whether  or  not  this 
represents  a  fundamental  and  universal  truth,  it  would  be  extremely  productive 
to  adopt  this  as  the  definition  of  unidimensionality.  It  would  probably  lead  to 
better  tests  than  exist  now. 


In  adaptive  testing  the  theory  and  simulations  suggest  that  adequate  mea¬ 
surements  can  be  obtained  with  a  handful  of  items.  Attempts  to  do  this  in  prac- 
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tice  have  not  worked  very  well.  This  suggests  that  the  technical  knowledge 
about  how  to  do  this  seems  to  be  growing  faster  than  the  substantive  wisdom 
about  why  it  is  desirable  to  do  it.  The  methodology  is  now  available  to  con¬ 
verge  very  rapidly  on  the  person's  location  on  a  unidimensional  trait;  however, 
there  do  not  seem  to  be  very  many  unidimensional  traits. 

If  it  is  intended  that  measuring  people  and  making  decisions  about  them 
based  on  those  measurements  continues,  then  a  serious  analysis  of  the  quality  of 
the  measure  for  every  person  should  be  routine.  Rather  than  being  satisfied  if 
the  person  is  given  a  very  short  test,  the  power  of  adaptive  testing  would  be 
better  employed  to  explore  the  dimensionality  of  the  space  for  the  person.  Is 
the  first  result  reproducible  over  a  useful  range  of  the  variable  and  for  anoth¬ 
er  selection  of  items?  For  most  persons,  the  results  will  simply  reassure  the 
psychometrician  that  everything  is  in  order;  for  other  persons,  something  inter¬ 
esting  may  be  learned  about  them  and  about  the  variable  being  pursued. 
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Robust  Estimation  in  the  Rasch  Model 


Howard  Wainer 
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Latent  trait  models  as  a  class,  and  the  Rasch  model  in  particular,  have 
begun  to  have  substantial  impact  on  the  construction  and  scoring  of  mental 
tests.  Through  the  use  of  latent  trait  models,  measures  of  individual  ability 
as  well  as  item  difficulty  that  have  important  practical  and  statistical  proper¬ 
ties  can  be  obtained.  For  example,  if  the  Rasch  model  fits,  the  measures  of 
ability  and  difficulty  obtained  are  interval-scaled,  thus  making  the  quantita¬ 
tive  study  of  change  possible.  The  Rasch  model  characterization  of  a  person's 
performance  on  an  item  as  a  function  of  the  difference  between  that  person's 
ability  and  the  difficulty  of  the  item  yields  the  useful  result  that  sample-free 
item  calibration,  as  well  as  test-free  person  measurement,  can  be  obtained. 

There  are  many  more  reasons  why  a  latent  trait  formulation  is  an  important  one 
(see,  e.g.,  Bock  &  Wood,  1971;  Hambleton,  Swaminathan,  Cook,  Eignor,  &  Gifford, 
1978;  Lord  &  Novick,  1968;  Rasch,  1960;  Wainer,  Morgan,  &  Gustafsson,  1979; 
Wright,  1968,  1977;  Wright  &  Panchapakesan ,  1969). 

The  problem  in  harvesting  the  benefits  of  latent  trait  models  is  the  prob¬ 
lem  of  fit,  since  these  benefits  follow  only  when  the  model  fits.  Studies  of 
robustness  (Lord  &  Novick,  1968,  p.  492)  indicate  that  certain  parameters  are 
robust  with  respect  to  modest  deviations  from  the  underlying  assumptions;  in 
particular,  it  seems  that  the  Rasch  model  yields  rather  good  estimates  of  abili¬ 
ty  and  difficulty  even  when  its  assumption  of  equal  slopes  is  only  roughly  ap¬ 
proximated.  The  models  that  parameterize  differential  slopes  have  difficulty 
recovering  the  slope  parameters  even  when  the  data  do  fit  their  model.  Although 
this  is  not  a  topic  of  the  present  paper,  it  is  desirable  to  indicate  that  at¬ 
tempts  to  expand  the  1-parameter  model  to  encompass  additional  possible  charac¬ 
teristics  of  the  data  through  an  increase  in  the  number  of  item  parameters  do 
not  appear  to  be  completely  successful  yet.  Slope  parameters  are  not  well  esti¬ 
mated  in  testing  situations  with  only  a  few  hundred  individuals  (Lord,  1979); 
and  lower  asymptotes,  introduced  to  deal  with  guessing,  cannot  be  consistently 
estimated  (Ree  &  Jensen,  1979). 

The  Problem 


If  the  Rasch  model  fits  a  given  set  of  data,  it  has  many  practical  bene¬ 
fits.  It  can  never  fit  exactly,  however,  because  there  are  always  disturbances. 
These  disturbances  often  take  the  form  of  (1)  guessing,  when  a  person  of  low 
ability  gets  a  difficult  item  correct,  and  (2)  sleeping,  when  a  person  of  high 
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ability  gets  an  easy  item  wrong  (Wright  &  Mead,  1977).  The  model  has  a  certain 
amount  of  robustness  with  respect  to  such  aberrations,  but  they  can  make  the 
estimation  procedures  both  biased  and  inefficient.  The  problem,  then,  is  how  to 
estimate  the  parameters  of  interest  accurately  and  efficiently  even  when  the 
data  do  not  fit  the  model. 

Some  Choices 


As  a  means  of  dealing  with  this  problem,  five  different  estimation  schemes 
will  be  considered.  These  alternatives  will  be  compared  over  a  variety  of  simu¬ 
lations.  It  will  be  assumed  that  item  difficulties  are  available  and  that  only 
person  abilities  are  to  be  estimated.  This  is  a  reasonable  assumption  because 
the  calibration  sample  can  be  greatly  increased.  Individuals  who  have  unusual 
patterns  of  response  can  be  winnowed  from  it  and  a  subset  of  individuals  who  are 
not  "noisy"  can  be  obtained.  These  individuals  can  then  be  used  to  obtain  good 
estimates  of  item  difficulty.  However,  the  reverse  is  not  true:  A  test  of  great 
length  cannot  be  given,  and  when  reporting  on  real  persons,  individuals  who  do 
not  behave  exactly  as  the  model  dictates  cannot  be  eliminated.  Abilities  should 
be  estimated  for  everyone.  The  task  is  to  explore  various  estimation  methodolo¬ 
gies  that  assume  the  availability  of  item  difficulties  and  to  try  to  estimate 
ability  as  accurately  and  efficiently  as  possible.  It  may  be  that  some  of  the 
techniques  described  will  be  of  some  use  in  the  estimation  of  item  difficulties 
as  well,  but  this  is  not  the  primary  motivation. 

The  Rasch  Model 


The  Rasch  model  is  based  on  the  equation 

Fij  =  exp(a£  -  +  exp (ai  -  <^)i  r 1 1 

where 

Pjj  is  the  probability  of  person  _i  answering  item  correctly; 
a;  is  the  ability  of  person  i_  (j^  =  1,...,N);  and 
dj  is  the  difficulty  of  item  _^  (^  =  1 , . . .  ,L). 

Scheme  1 :  Pure  Rasch 


This  is  the  standard  maximum  likelihood  method  for  estimating  Rasch  abili¬ 
ties,  given  a  vector  of  item  difficulties.  It  relies  on  the  Rasch  model  proper¬ 
ty  that  the  raw  score  is  a  sufficient  statistic  for  estimating  ability.  Each 
raw  score  has  a  distinct  ability  level  associated  with  it.  To  find  what  it  is, 
Equation  1  is  solved  for  a^,  usually  through  Newton-Raphson : 

121 

or 

d.)/(  1  +  exp  (a  .  -  d  .))]  ■  0  [3] 

fj  T*  t7 

person  i. 


ri  ~  Vexp(ai  " 


where  r4  is  the  raw  score  for 
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Scheme  2:  Traditional  Correction  for  Guessing 

The  traditional  guessing  correction  is  the  assumption  that  if  a  person  does 
not  know  the  answer  to  a  question  and  guesses,  then  the  probability  of  guessing 
correctly  is  1/M,  where  M  is  the  number  of  choices.  Thus,  if  there  is  an  M- 
choice  test  and  an  individual  has  C  wrong,  it  is  assumed  that  he/she  has  an  ad¬ 
ditional  C/(M  -  1)  correct  as  a  result  of  guessing.  This  is  a  crude  attempt  to 
put  a  lower  asymptote  on  the  item  characteristic  curve. 

Scheme  3:  Standard  Jackknife 


The  Jackknife  is  an  estimation  scheme  that  was  developed  to  reduce  bias  and 
has  been  shown  (Tukey,  1958)  to  be  useful  for  hypothesis  testing  as  well.  The 
way  that  it  works  in  the  application  in  this  study  is  to  construct  a  matrix  of 
abilities,  A,  which  has  L  -  1  raw  scores  labeling  the  rows  and  L  +  1  columns. 

The  first  column,  with  elements  A(r,  1)  are  the  abilities  associated  with  raw 
score  r,  calculated  through  the  method  described  in  Scheme  1.  The  second  column 
includes  the  ability  levels  based  upon  a  test  with  the  first  item  omitted.  This 
test  has  only  L  -  1  items.  Each  succeeding  column  represents  ability  estimated 

through  Scheme  1  but  with  that  item  omitted.  Thus  the  k1"*1  column  is  a  test  of 
length  L  -  1  containing  all  items  except  item  k  -  1. 

The  Jackknifed  pseudovalues  of  ability  are 


a.*  =  £A(iM)  -  <L-l)[x.A(r  -  l.j  +  1) 

J  .  « 

+  (1  -  x.)A(r,j  +  X)] 

t) 


where 


-j  " 


0  if  item  j  is  answered  incorrectly 
1  if  item  j  is  answered  correctly;  and 


the  Jackknifed  estimate  of  ability,  a*,  is  just  the  mean  of  these  a^.*'s: 

a*  =r  .[a  .*/L]  =  LA(  r,  1)  -  [  (L  -  D/L] 

J  J 


E  .  [  x  .  A  (  r  -  l,j  +  1)  +  (1  -  x.)A(r,j  +  1)] 

tl  t J  V 


[4] 


[5] 


for  =  1 ,  L. 

For  reasons  that  will  become  clear  when  the  results  of  the  simulations  are 
discussed,  it  is  important  to  note  that  the  Jackknifed  ability  estimates  are 
easy  to  compute.  For  any  test  all  that  has  to  be  done  is  to  compute  the  matrix 
A  and  then,  for  each  person,  to  run  across  the  matrix  at  that  person's  raw 
score,  adding  up  the  entries  in  that  row  for  each  item  that  is  incorrect  and 
jumping  up  one  row  for  each  item  that  is  correct.  Jumping  occurs  when  an  item 
is  correct,  because  the  raw  score  for  that  person  excluding  that  item  is  then 
one  less.  • 


^***-^*.  .  a. 
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Next,  there  are  two  aspects  of  an  estimator  that  are  of  concern.  First,  it 
reduces  bias,  i.e.,  the  effects  of  odd  response  patterns.  The  Jackknife  was 
developed  as  a  method  to  reduce  bias  (Quenouille,  1956),  so  it  is  hoped  that  it 
will  serve  this  purpose.  Secondly,  it  is  desirable  that  the  estimator  does  not 
fluctuate  too  much  with  minor  disturbances  in  the  response  vector.  This  quality 
has  been  termed  "resistance"  (Tukey,  1977)  and  corresponds  to  an  estimator  hav¬ 
ing  a  sampling  distribution  with  a  small  variance.  The  Jackknife  is  known  to  be 
modestly  "resistant";  so  this  quality  is  likely  to  be  met  in  practice  as  well. 

To  see  how  estimation  with  the  Jackknife  works,  consider  a  test  with  10 
items  whose  difficulties  are  uniformly  distributed,  spanning  a  range  of  four 
logits.  These  difficulties  are  shown  below: 


-2.00  -1.56  -1.11  -0.67  -0.22 
0.22  0.67  1.11  1.56  2.00 

This  yields  the  raw  score-to-ability  transformation  matrix  A,  shown  in  Table  1. 

Table  1 

The  Raw  Score  to  Ability  Conversion  Matrix 


Ability  Ability  Estimate  Omitting  Item  1 


Raw 

Score 

Estimate 
All  Items 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

-2.78 

2.32 

-2.45 

-2.56  -2.63 

-2.68 

-2.72 

-2.74 

-2.75 

-2.76  -2.77 

2 

-1.83 

1.37 

-1.45 

-1.54  -1.63 

-1.69 

-1.74 

-1.77 

-1.79 

-1.80  -1.81 

3 

-1.15 

0.68 

-0.73 

-0.80  -0.88 

-0.95 

-1.01 

-1.05 

-1.09 

-1.11  -1.12 

4 

-0.56 

0.07 

-0.10 

-0.15  -0.21 

-0.29 

-0.35 

-0.41 

-0.46 

-0.49  -0.51 

5 

0.00 

0.51 

0.49 

0.46 

0.41 

0.35 

0.29 

0.21 

0.15 

0.10 

0.07 

6 

0.56 

1.12 

1.11 

1.09 

1.05 

1.01 

0.95 

0.88 

0.80 

0.73 

0.68 

7 

1.15 

1.81 

1.80 

1.79 

1.77 

1.74 

1.69 

1.63 

1.54 

1.45 

1.37 

8 

9 

1.83 

2.78 

2.77 

2.76 

2.75 

2.74 

2.72 

2.68 

2.63 

2.56 

2.45 

2.32 

Consider  how  ability  for  a  response  vector  of  (1111110001)  would  be  esti¬ 
mated.  The  raw  score  is  7,  so  the  first  6  values  associated  with  a  raw  score  of 
6  are  summed  (since  the  first  6  items  were  correct).  Next,  the  three  values 
(associated  with  Items  7,  8,  and  9)  associated  with  a  raw  score  of  7  are  added 
on,  since  these  items  were  incorrect;  so  omitting  them  still  yields  a  raw  score 
of  7.  Last,  .68,  the  ability  pseudovalue  associated  with  a  raw  score  of  6  for 
Item  10  omitted  is  added  on.  Summing  these  gives  a  total  of  11.63.  Next,  this 
is  multiplied  by  9/10  [(L  -  1)/L]  and  subtracted  from  11.50  [L  x  1.15],  yielding 
a  Jackknifed  estimate  for  this  person's  ability  of  1.03.  Referring  back  to  Ta¬ 
ble  1,  it  can  be  seen  that  a  raw  score  of  6  yields  an  ability  estimate  of  .56, 
which  would  have  been  the  result  if  this  person's  answering  the  last  item  cor¬ 
rectly  had  been  treated  as  a  wild  guess  and  changed  to  incorrect.  On  the  other 
hand,  if  this  response  were  fully  believed,  his/her  raw  score  would  have  been  7 
and  his/her  ability  estimate  1.15.  The  Jackknife  weighs  these  two  extremes  and 
places  the  estimate  between  them. 
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Next,  suppose  that  the  response  vector  was  (1111110010).  Then,  it  is  found 
that  the  pseudovalue  of  .68  associated  with  getting  Item  10  correct  is  replaced 
with  .73  (for  Item  9)  and  1.45  is  replaced  by  1.37.  The  net  result  of  this 
changes  the  Jackknifed  estimate  from  1.03  to  1.06.  This  is  just  what  a  sensible 
person  would  do,  since  the  second  response  pattern  is  more  likely  to  have  arisen 
through  "proper"  test  taking  and  indicates  a  somewhat  higher  ability. 

It  appears  that  the  Jackknife  does  what  is  desired,  although  how  well  is 
yet  to  be  determined.  It  seems,  however,  from  this  demonstration  that  the  vari¬ 
ance  of  the  sampling  distribution  of  the  Jackknifed  ability  is  apt  to  be  small, 
since  large  disturbances  in  response  pattern  do  not  cause  large  variations  in 
the  ability  estimates.  To  see  this,  note  that  the  ability  estimate  associated 
with  the  pattern  (1111111001)  is  1.09.  (Other  patterns  can  be  attempted  in  or¬ 
der  to  observe  how  this  estimation  scheme  behaves.)  The  Jackknife  is  not  insen¬ 
sitive  to  response  pattern,  as  Rasch  estimates  are,  but  it  does  not  fluctuate 
much.  This  will  be  demonstrated  in  the  results  section. 

Scheme  4:  AMT-Robustified  Jackknife 


The  pseudovalues  obtained  from  standard  Jackknifing  suggest  an  additional 
estimation  methodology.  Consider  the  response  pattern  (1111110001)  again.  If 
the  pseudovalue  associated  with  each  item  is  calculated  using  Equation  2,  this 
gives 


Item  Pseudovalue 

1  1.42 

2  1.51 

3  1.69 

4  2.05 

5  2.41 

6  2.95 

7  -3.17 

8  -2.36 

9  -1.55 

10  5.38 

The  mean  of  these  pseudovalues  yields  the  Jackknifed  estimate  of  ability. 
Now  consider  these  pseudovalues  and  how  they  are  combined  in  the  Jackknife. 

There  are  two  kinds  of  pseudovalues — negative  ones  associated  with  incorrect 
responses  and  positive  ones  associated  with  correct  responses.  The  Jackknife 
could  be  understood  as  first  averaging  the  negative  ones,  thus  coming  out  with 
an  average  ability  estimate  based  upon  items  missed;  then,  averaging  the  posi¬ 
tive  ones  for  an  ability  estimate  from  the  items  answered  correctly;  and  final¬ 
ly,  combining  these  two  averages,  weighted  by  their  sample  sizes  to  yield  the 
final  Jackknifed  estimate.  It  is  known  that  the  mean  can  be  a  poor  way  to  esti¬ 
mate  location.  In  some  situations  (Andrews,  Bickel,  Hampel,  Huber,  Rogers,  & 
Tukey,  1972)  it  is  the  worst  of  all  choices.  Since  concern  is  with  unusual  sit¬ 
uations,  perhaps  the  performance  of  the  Jackknife  can  be  improved  through  the 
choice  of  an  estimator  of  location  more  robust  than  the  mean. 
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Suppose  the  median  of  the  positive  pseudovalues  is  calculated.  This  is 
2.05.  The  median  of  the  negative  pseudovalues  is  -2.36.  Weighting  these  by  7 
and  3,  respectively,  and  summing  and  dividing  by  10  yields  an  estimated  ability 
of  .73.  Whether  or  not  this  is  better  than  the  Jackknifed  value  of  1.03  is  dif¬ 
ficult  to  determine,  but  it  is  certainly  not  too  deviant. 

One  of  the  winners  of  the  Princeton  Robustness  Study  (Andrews  et  al . ,  1972) 
was  the  sine  M  estimator  (the  AMT).  This  estimator  has  an  influence  function 
nearly  like  that  of  the  mean  for  observations  in  close  but  going  to  zero  at  the 
extremes.  This  implies  that  it  will  be  efficient  for  nearly  Gaussian  distribu¬ 
tions  and  robust  against  fat  tails  and  outliers. 

To  understand  how  the  AMT  is  calculated,  consider  that  in  regular  cases, 
likelihood  estimation  of  the  location  and  scales  parameters  0  and  a  of  a  sample 
from  a  population  with  known  shape  leads  to  equations  of  the  form 

E  .  [-/'( b.)  /  /(».)]  =  0  ,  [6] 

J  v  <7 

and 

l,  [z.f'iz.)  /  f{z.)  -  1]  =  0  ,  [7] 

where  is  the  density  function  and  z_^  =  (xj  -  0)/o- 

M  estimates  of  location  are  solutions,  T,  of  an  equation  of  the  form 

l.  Y[  (x  .  -  T)/s]  =»  0  [8] 

d  0 

where  4*  is  an  odd  function  and  s  is  estimated  either  independently  or  simulta¬ 
neously. 

The  sine  M  estimate  (AMT)  is  an  M  estimate  in  which  the  function  ip  is 


Sin(x/2 . 1) 


Iri  <  2. in 


\  0  otherwise  . 

The  fourth  scheme,  then,  is  to  use  the  AMT  estimator  on  the  positive  and 
negative  pseudovalues  separately,  obtaining  two  estimates  of  ability.  These  two 
estimates  are  then  weighted  by  the  number  of  observations  that  went  into  them 
and  summed.  The  resulting  value  is  then  divided  by  the  total  number  of  items 
and  the  result  is  the  AMT  Jackknife  estimate. 

It  is  expected  that  when  the  test  response  pattern  is  reasonable  (i.e.,  no 
responses  are  obtained  that  are  unlikely,  based  upon  the  Rasch  model),  the  AMT- 
Jackknife  will  look  like  the  standard  Jackknife.  But  when  there  are  some  odd 
responses,  they  will  not  be  counted  as  heavily  and  thus  will  produce  an  estimate 
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that  is  less  affected  by  guessing  and  sleeping,  while  retaining  the  standard 
Jackknife' s  narrow  sampling  distribution. 

Scheme  5 :  WIM 


Wright  and  Mead  (1976)  developed  a  method  for  estimating  ability  in  the 
Rasch  model  based  upon  an  analysis  of  the  residuals.  Their  method  obtains  an 
initial  estimate  of  ability  from  raw  score  and  its  associated  standard  error, 
then  calculates  the  residual  of  each  item's  response  for  that  person  by  sub¬ 
tracting  from  the  response  the  probability  of  its  being  correct.  These  residu¬ 
als  are  standardized  and  a  t_  statistic  is  calculated  for  the  fit  of  this  per¬ 
son's  response  pattern.  If  this  jt  is  greater  than  some  chosen  value  (say,  t^  * 
2),  then  all  items  more  than  two  logits  above  the  person's  initial  ability  esti¬ 
mate  are  omitted  from  that  person's  test  and  a  new  ability  estimate  is  obtained 
based  upon  the  shortened  test.  This  process  is  repeated  until  an  acceptable  _t 
is  achieved  or  until  the  test  becomes  too  short  to  work  with. 

This  estimation  scheme  (WIM)  was  also  included  in  the  tests  reported  in 
this  paper.1  The  results  with  this  method  reflect  only  on  the  method  as  it  was 
received;  there  was  no  attempt  to  tune  it  by  varying  the  critical  _t  value.  It 
could  be  that  its  performance  would  improve  with  fine  tuning. 

Method 


The  Guessing  Model 

How  the  individual  responses  in  a  simulation  are  characterized  is  critical¬ 
ly  important  to  its  outcome.  Certainly,  if  an  estimator  that  matched  the  re¬ 
sponse  generator  was  built,  that  estimator  should  emerge  as  superior  in  any  com¬ 
petition.  The  validity  of  such  investigations  depends  upon  how  the  response 
model  matches  reality.  It  was  decided  that  a  reasonable  model  for  responding 
has  the  following  characteristics: 

1.  Need.  A  person  guesses  if  he/she  has  a  need  to  guess.  This  is  a  func¬ 
tion  of  the  extent  to  which  the  item  is  more  difficult  than  the  person 
is  able.  If  people  think  they  know  the  answer,  they  will  not  guess;  if 
they  do  not,  they  might. 

2.  Invitation.  This  is  a  function  of  the  item,  unrelated  to  its  diffi¬ 
culty  (usually  a  function  of  the  distractors).  Some  items  invite 
guessing;  others  discourage  it. 

3.  Inclination.  This  is  a  function  of  people  unrelated  to  ability.  Some 
people  like  to  guess  (risk  takers?)  and  others  do  not  (risk  avoiders?). 

4.  Glitch.  This  represents  something  unexpected,  which  may  be  an  item- 
person  interaction  unrelated  to  ability,  difficulty,  inclination  or 
invitation. 


1The  subroutine  that  performs  WIM  estimation  was  written  by  Ronald  Mead. 
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The  guessing  model  is 


id 


=  P 


+  (1  - 


P  ■  .)  (7  . 

T'd  3 


+  P 


V  .c  .) /u  . 

0  i'  3 


[10] 


where 


"U 

pij 


C. 

l 


u 


j 


is  the  probability  of  person  i  getting  item  j  correct; 

is  the  probability  of  person  i^  getting  item  correct  based 
upon  the  Rasch  model  given  earlier  (the  need  to  guess  arises 
when  is  small  because  clj  is  larger  than  a^); 

is  the  invitation  to  guess  associated  with  item  _j^  (0  1); 

is  the  inclination  to  guess  associated  with  person  _i 
(0  <  ci  <  i);  and 

is  the  number  of  alternatives  for  item  j. 


The  actual  response  that  was  generated  by  this  model  was  determined.  It 
was  allowed  to  remain  with  probability  1  -  G  (where  G  is  the  glitch  factor)  and 
was  changed  with  probability  G,  the  generating  parameter  included  to  stir  up 
trouble  and  add  noise. 


The  Simulation 


Independent  variables.  There  are  a  large  number  of  factors  to  be  varied  in 
a  simulation  in  order  to  obtain  a  complete  picture  of  what  is  happening.  This 
simulation  had  eight  factors  that  were  systematically  varied  and  on  which  all 
five  estimation  schemes  were  tried  out.  These  were: 

1.  Difficulty  distribution  (3  levels).  There  were  three  distributions  of 
difficulties  that  were  used:  uniform,  Gaussian,  and  bimodal.  The  bi- 
modal  distribution  was  generated  by  constructing  a  uniform  distribution 
and  leaving  out  the  middle  half. 

2.  Test  length  (3  levels).  Tests  of  three  lengths  were  simulated:  short 
(10  items),  medium  (20  items),  and  long  (40  items).  Longer  tests  were 
not  used  because  the  generalizability  of  results  would  increase  only 
slightly  but  computer  costs  would  multiply. 

3.  Test  width  (2  levels).  Two  test  widths  were  simulated — narrow  (2  log¬ 
its)  and  medium  (4  logits). 

4.  Number  of  alternatives  (2  levels).  Tests  with  five  choices  were  simu¬ 
lated,  since  that  reflects  a  common  test  format,  as  were  tests  with  two 
alternatives  (true-false  format),  which  represents  an  extreme  case. 

5.  Ability  (4  levels).  Four  levels  of  ability  were  used:  Very  Low,  Low, 
Medium,  and  High.  Typically,  Very  Low  was  chosen  as  an  ability  that 
was  the  same  as  the  easiest  item  on  the  test.  Medium  was  typically 
chosen  as  zero,  with  Low  halfway  between  them.  High  was  usually  sym¬ 
metric  with  Low.  Therefore,  with  the  difficulties  shown  previously, 
the  four  abilities  chosen  would  be  -2,  -1,  0,  and  +1.  There  was  some 
variation  in  this  choice,  which  will  be  explained  below. 
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6.  Invitation  to  guess  (3  levels).  This  ranged  from  Low  (6.0),  to  Medium 
(.5),  to  High  (.9). 

7.  Inclination  to  guess  (3  levels).  The  same  as  Invitation.  As  is  evi¬ 
dent  from  the  response  model,  these  two  parameters  are  symmetric  in 
their  effect;  so  only  the  six  interesting  combinations  were  used. 

8.  Glitch  (3  levels).  Glitch  is  meant  to  convey  rare,  or  at  most  seldom, 
trouble.  Thus,  three  levels  of  glitch  were  used:  none  (.0),  a  little 
(.1),  and  a  lot  (.4).  Note  that  a  glitch  of  .5  is  maximum,  in  that  it 
will  make  the  expected  score  for  any  response  pattern  the  same  (L/2). 

Dependent  variables.  Two  aspects  of  estimator  performance  were  of  inter¬ 
est.  The  first  is  accuracy:  How  different  is  the  estimate  of  ability  obtained 
from  each  estimator  from  the  ability  parameter  that  generated  the  response  vec¬ 
tor?  This  has  been  summarized  by  the  mean  difference  between  estimated  ability 
for  each  estimator  and  the  generating  parameter.  In  the  course  of  the  simula¬ 
tion  this  was  sometimes  violated,  because  as  a  response  vector  was  generated,  it 
was  checked  to  see  if  it  was  estimable.  In  particular,  if  a  response  vector  had 
a  raw  score  of  1  or  lower  or  L  -  1  or  higher,  it  was  not  used,  and  another  was 
generated.  This  resulted  in  a  truncation  of  the  ability  distribution.  This 
truncation  caused  the  low-ability  groups  to  have  somewhat  higher  ability  than 
the  generating  parameter  would  indicate  and  the  high-ability  group  to  have 
slightly  lower  ability  than  the  generating  parameter.  To  correct  for  this,  the 
Rasch  ability  was  estimated  without  any  noise  for  a  particular  simulation  situa¬ 
tion  (a  specific  length,  width,  distribution,  and  glitch)  and  the  pure  Rasch 
ability  estimates  were  used  as  the  basis  of  comparison  for  that  simulation. 
Hence,  when  there  was  no  noise,  the  Rasch  estimates  had  zero  bias  by  construc¬ 
tion. 


The  second  aspect  of  estimator  performance  that  was  of  interest  was  the 
variance  of  the  sampling  distribution  of  that  estimator  around  its  own  mean.  Of 
course,  the  smaller  this  was,  the  better  the  estimator. 

These  two  measures  of  estimator  performance  were  combined  into  a  total 
variance  figure  by  adding  together  the  weighted  squared  bias  (analogous  to  the 
between-sum-of-squares)  to  the  sampling  variance  (the  within-sum-of-squares) , 
using  the  usual  synthesis  of  variance  weightings.  This  represented  the  overall 
efficiency  of  each  estimator.  That  estimator  having  the  smallest  efficiency  for 
that  sample  was  then  found  and  each  estimator's  efficiency  was  divided  into  it 
to  obtain  relative  efficiency.  It  is  this  figure  that  will  be  reported. 

Results  and  Discussion 


Obviously,  with  a  design  consisting  of  almost  4,000  cells  and  5  estimators 
per  cell,  it  would  be  impractical  to  attempt  to  present  all  the  results.  In¬ 
stead,  selected  findings  representative  of  the  main  effects  will  be  presented, 
and  some  important  interactions  and  trends  will  be  discussed.  The  principle 
result  was  that  one  method  was  superior — the  AMT-Jackknife.  The  AMT-Jackknife 
was  superior,  not  because  it  was  the  most  bias-free  (although  it  did  reasonably 
well  in  that  regard),  but  rather  because  of  its  extremely  small  sampling  vari¬ 
ance. 


No  Noise 


Before  discussing  the  noisy  simulations,  the  uncontaminated  situation  will 
be  considered.  It  would  seem  that  any  estimation  scheme  proposed  must  do  rea¬ 
sonably  well  in  this  situation  before  it  an  be  considered  a  viable  alternative 
to  ordinary  methods. 


Table  2 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  =  0,  Guessing  Inclination  =  0,  Glitch  •  0, 
for  Five-Choice  Items  with  a  Uniform  Distribution  of  Item  Difficulties 


Test  Length 

10 

Items 

20  Items 

40 

Items 

Width  and 

Very 

Very 

Very 

Estimator 

Low 

Low 

Med.  High 

Low  Low  Med.  High 

Low 

Low 

Med.  High 

2  Logits 


Rasch 

.7 

.7 

.7 

.7 

.8 

.8 

.9 

.8 

.9 

.9 

.9 

.9 

Traditional 

.2 

.1 

.2 

.2 

.1 

.1 

.2 

.4 

.0 

.1 

.2 

.3 

Jackknife 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

AMT-Jackknife 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

WIM 

.7 

.7 

.7 

.7 

.8 

.9 

.9 

.8 

.9 

.9 

.9 

.9 

4  Logits 

Rasch 

.8 

.7 

.7 

.8 

.8 

.8 

.8 

.8 

.9 

.9 

.9 

.9 

Tr  ad  it  ion  al 

.2 

.2 

.2 

.3 

.1 

.1 

.2 

.4 

.0 

.0 

.2 

.3 

Jackknife 

1.0 

.9 

.9 

1.0 

1.0 

.9 

.9 

.9 

1.0 

1.0 

.9 

.9 

AMT-Jackknife 

.9 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

WIM 

.7 

.7 

.6 

.7 

.7 

.7 

.8 

.7 

.7 

.9 

.9 

.9 

Table  2  shows  the  relative  efficiencies  ( to  1  decimal  place)  of  the  5  esti¬ 
mators  for  3  test  lengths,  2  different  widths,  and  4  abilities.  The  results  for 
a  uniform  distribution  of  difficulties  were  striking  for  two  reasons.  First, 
they  demonstrate  the  superiority  of  the  AMT-Jackknife  (followed  closely  by  the 
standard  Jackknife),  assuring  that  the  Jackknife  is  a  viable  scheme.  Secondly, 
the  Rasch  maximum  likelihood  estimator  was  not  the  most  efficient.  This  coun¬ 
ters  expectation,  since  maximum  likelihood  is  supposed  to  yield  estimates  with 
minimum  variance.  Why  did  that  fail  to  happen  in  this  case?  The  answer  is  that 
the  properties  of  maximum  likelihood  estimators  are  asymptotic.  As  test  length 
increased,  the  relative  efficiency  of  the  Rasch  estimator  increased  from  70%  to 
90%.  The  WIM  estimator  behaved  in  the  same  way.  It  would  seem  that  40  items  is 
not  enough  for  asymptotic  properties  to  perform  better  than  Jackknife  properties. 
This  finding  leads  to  the  reconsideration  of  the  use  of  maximum  likelihood  esti¬ 
mators  with  short  tests  without  further  thought.  Replacing  maximum  likelihood 
with  AMT-Jackknife  may  benefit  short  test  applications.  The  authors  are  not  the 
first  to  observe  that  maximum  likelihood  does  not  accomplish  everything  desired 
from  efficient  estimation.  Lewis  (1970),  in  studying  methods  for  the  estimation 
of  thresholds  of  sensitivity  curves  (a  problem  similar  to  the  one  being  exam- 
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ined),  found  that  maximum  likelihood  was  unsatisfactory  and  used  instead  a 
scheme  based  on  order  statistics. 


Table  3 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  -  0,  Guessing  Inclination  -  0,  Glitch  -  0, 
for  Five— Choice  Items  with  a  Gaussian  Distribution  of  Item  Difficulties 


Test 

Length 

10 

Items 

20 

Items 

40 

Items 

Width  and 
Estimator 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.7 

.7 

.7 

.7 

.8 

.9 

.8 

.8 

.9 

.9 

.9 

.9 

Traditional 

.2 

.1 

.2 

.2 

.1 

.1 

.2 

.3 

.0 

.1 

.2 

.3 

Jackknife 

1.0 

1.0 

.9 

.9 

1.0 

1.0 

.9 

1.0 

1.0 

1.0 

1.0 

1.0 

AMT-Jackknife 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

WIM 

.7 

.7 

.7 

.7 

.8 

.8 

.8 

.8 

.9 

.9 

.9 

.9 

4  Logits 

Rasch 

.8 

.7 

.7 

.7 

.8 

.8 

.8 

.8 

.8 

.9 

.8 

.8 

Traditional 

.1 

.2 

.2 

.3 

.1 

.1 

.2 

.4 

.0 

.0 

.2 

.3 

Jackknife 

1.0 

1.0 

.8 

.9 

1.0 

.9 

.9 

.9 

1.0 

1.0 

.9 

.9 

AMT-Jackknife 

.8 

1.0 

1.0 

1.0 

.9 

1.0 

l.C 

1.0 

1.0 

1.0 

1.0 

1.0 

WIM 

.7 

.7 

.6 

.6 

.8 

.6 

.8 

.7 

.9 

.8 

.9 

.8 

Table  3  shows  that  the  same  structure  observed  for  a  uniform  distribution 
held  for  a  Gaussian  distribution.  Once  again,  the  AMT -Jackknife  was  superior, 
followed  closely  by  the  standard  Jackknife,  and  then  by  Rasch  and  WIM.  In  all 
situations  the  Traditional  guessing  correction  performed  poorly.  This  was  not 
unanticipated,  since  corrections  are  being  made  for  a  disturbance  that  is  total¬ 
ly  absent.  As  will  be  seen  later,  the  performance  of  the  Traditional  estimator 
improved  when  guessing  did  occur  (not  surprisingly).  Incidentally,  WIM,  which 
is  the  most  computationally  expensive  procedure,  is  especially  expensive  for 
Gaussian  and  bimodal  distributions  of  difficulty.  More  iterations  are  required 
for  convergence  in  these  situations  than  when  the  difficulties  are  uniform. 

Table  4  shows  the  efficiencies  for  a  bimodal  distribution  evidencing  essen¬ 
tially  the  same  structure  that  appeared  with  the  other  two  distributions.  WIM 
estimates  were  not  obtained  for  a  40-item  test  (Width  2)  when  the  procedure  had 
not  converged  after  100  seconds  (on  an  Amdahl/V6).  It  was  felt  that  any  infor¬ 
mation  obtained  from  such  a  result  would  not  be  worth  the  cost  or  effort. 

One  conclusion  is  clear:  When  there  is  no  guessing,  the  maximum  likelihood 
estimator  of  ability  in  the  Rasch  model  can  be  improved  for  tests  of  modest 
length  (less  than  40  items  or  so).  In  this  noiseless  situation  there  is  little 
to  choose  from  between  the  robust  AMT-Jackknife  and  the  standard  Jackknife.  The 
AMT  was  somewhat  better  but  used  a  little  more  effort  in  its  computation.  It 
was  also  found  that  the  Traditional  correction  for  guessing,  if  applied  when 
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Table  4 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  =  0,  Guessing  Inclination  -  0,  Glitch  ■  0, 
for  Five-Choice  Items  with  a  Bimodal  Distribution  of  Item  Difficulties 


Test  Length 

10 

Items 

20 

Items 

40  Items 

Width  and 
Estimator 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med . 

High 

2  Logits 

Rasch 

.6 

.6 

.5 

.6 

.8 

.7 

.6 

.6 

.9 

.9 

.6 

.9 

Traditional 

.1 

.1 

.1 

.2 

.1 

.1 

.1 

.2 

.0 

.1 

.1 

.2 

Jackknife 

.8 

.7 

.6 

.8 

1.0 

.7 

.6 

.7 

1.0 

1.0 

.6 

1.0 

AMT-Jackkni f e 

1.0 

1.0 

1.0 

1.0 

.8 

1.0 

1.0 

1.0 

.8 

1.0 

1.0 

1.0 

WIM 

.6 

.6 

.5 

.6 

.8 

.6 

.6 

.6 

.9 

.9 

.6 

.9 

4  Logits 

Rasch 

.8 

.6 

.2 

.8 

.8 

.9 

.2 

.9 

.9 

1.0 

.2 

.9 

Traditional 

.2 

.1 

.0 

.2 

.1 

.1 

.0 

.3 

.0 

.1 

.0 

.2 

Jackknife 

1.0 

.7 

.2 

.9 

1.0 

1.0 

.2 

1.0 

1.0 

1.0 

.2 

1.0 

AMT-Jackkni fe 

.6 

1.0 

1.0 

1.0 

.3 

.7 

1.0 

.9 

.2 

.4 

1.0 

.5 

WIM 

.7 

.6 

.2 

.7 

.8 

.8 

.2 

.8 

* 

* 

* 

* 

guessing  is  absent,  can  have  disastrous  effects  upon  the  efficiency  of  estima¬ 
tion.  WIM  worked  as  well  as  straight  Rasch  estimation  when  there  was  no  guess¬ 
ing,  although  it  did  lead  to  some  shrinkage  due  to  the  shortening  of  tests  when 
unusual  residuals  occurred  by  chance. 

Some  Guessing 


The  next  step  in  the  exploration  of  estimators  of  ability  was  to  study 
their  behavior  with  a  small  amount  of  noise.  Tables  5,  6,  and  7  show  the  rela¬ 
tive  efficiencies  for  the  three  distributions  with  guessing  invitations  and 
guessing  inclinations  set  at  .5.  Even  a  cursory  examination  shows  that  the 
structure  observed  in  the  no  noise  situation  still  obtained.  The  AMT-Jackknife 
and  the  standard  Jackknife  were  still  superior,  but  the  WIM  and  the  Traditional 
corrections  improved.  The  bimodal  distribution  seemed  to  trouble  the  Jackknife 
more  than  its  robustified  version;  however,  both  seemed  to  do  satisfactorily. 

As  would  be  suspected,  at  lower  ability  levels,  schemes  designed  to  deal  with 
guessing  (WIM  and  Traditional)  worked  to  their  best  advantage.  At  higher  abili¬ 
ty  levels,  this  was  not  the  case.  Jackknifing  schemes  did  better  on  narrow 
tests  than  on  wide  ones,  an  observation  that  has  been  confirmed  by  examining 
their  behavior  on  very  wide  tests  of  six  to  eight  logits  and  noting  a  deteriora¬ 
tion  of  performance;  this  was  especially  marked  on  eight  logit-^wide  tests  for 
the  AMT. 

The  conclusions  reached  for  noiseless  data  still  hold,  but  less  strongly. 
The  two  Jackknife  methods  remain  the  methods  of  choice,  especially  for  individu¬ 
als  above  mean  ability.  But  as  the  data  become  increasingly  noisy,  each  esti¬ 
mator  reacted  in  its  own  way.  The  Rasch  estimator  yielded  the  same  score  for 
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Table  5 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 


Guessing  Invitation  =  . 
for  Five-Choice  Items  with 

5,  Guessing  Inclination 
a  Uniform  Distribution 

of 

5,  Glitch  =  0, 

Item  Difficulties 

Test  Length 

10  Items 

20  Items 

40 

Items 

Width  and 

Very 

Very 

Very 

Estimator 

Low 

Low 

Med. 

High 

Low  Low  Med . 

High 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.8 

.8 

.7 

.6 

1.0  .9  .9 

.8 

1.0 

1.0 

.9 

.8 

Tradit ional 

.2 

.3 

.3 

.4 

.4  .5  .5 

.5 

.5 

1.0 

.7 

1.0 

Jackkn i fe 

1.0 

1.0 

1.0 

1.0 

1.0  1.0  1.0 

1.0 

1.0 

1.0 

1.0 

.9 

AMT-Jackkni fe 

1.0 

1.0 

1.0 

1.0 

1.0  1.0  1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

WIM 

.8 

.8 

.7 

.6 

1.0  .9  .9 

.8 

1.0 

1.0 

.9 

.8 

4  Logits 

Rasch 

.9 

.8 

.7 

.6 

.9  1.0  .8 

.8 

.8 

1.0 

.9 

.7 

Traditional 

.4 

.5 

.3 

.4 

.6  .6  .7 

.5 

.4 

1.0 

.7 

.8 

Jackknife 

1.0 

.9 

.8 

.9 

.9  1.0  .9 

.8 

.8 

1.0 

.9 

.8 

AMT-Jackkni fe 

1.0 

1.0 

1.0 

1.0 

.9  1.0  1.0 

1.0 

.8 

.9 

1.0 

1.0 

WIM 

.8 

.8 

.6 

.5 

1.0  1.0  .8 

.6 

1.0 

1.0 

.9 

.7 

Table  6 

Relative  Efficiencies  of  Five 

Estimators  on  Tests  of 

Various  Lengths 

and  Widths 

for  Four  Ability  Levels  (Very  Low,  Low,  Medium, 

High), 

wi  th 

Guessing 

Invitation  =  . 

5,  Guessing  Inclination 

• 

5,  Glitch  = 

o, 

for  Five-Choice 

Items 

wi  th 

a  Gaussian  Distribution  of 

Item  Difficulties 

Test  Length 

10  Items 

20  Items 

40 

Items 

Width  and 

Very 

Very 

Very 

Est imator 

Low 

Low 

Med. 

High 

Low  Low  Med. 

High 

Low 

Low 

Med. 

High 

2  Logits 
Rasch 

Tradit ional 
Jackkn i fe 
AMT-Jackkni fe 
W1M 

A  Logits 
Rasch 

Tradit ional 
Jackkn i fe 
AMT-Jackkni fe 
WIM 


.8 

.8 

.7 

.6 

1.0 

.9 

.3 

.3 

.4 

.4 

.4 

.4 

1.0 

1.0 

1.0 

.9 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.8 

.8 

.7 

.6 

1.0 

.9 

1.0 

.9 

.7 

.5 

1.0 

1.0 

.4 

.5 

.4 

.3 

.6 

.7 

1.0 

1.0 

.9 

.8 

.9 

1.0 

.9 

1.0 

1.0 

1.0 

.9 

1.0 

.8 

.8 

.7 

.5 

1.0 

1.0 

.8 

.7 

1.0 

.9 

.8 

.8 

.6 

.5 

.5 

1.0 

.7 

1.0 

.9 

.9 

1.0 

.9 

.9 

.8 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.8 

.7 

1.0 

.9 

.9 

.8 

.7 

.6 

1.0 

.8 

.8 

.8 

.6 

.4 

.5 

1.0 

.7 

.8 

.8 

.8 

1.0 

.8 

.8 

.8 

1.0 

1.0 

.9 

.8 

1.0 

1.0 

.7 

.6 

1.0 

.9 

.9 

.8 
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Table  7 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  -  .5,  Guessing  Inclination  -  .5,  Glitch  *  0, 
for  Five-Choice  Items  with  a  Bimodal  Distribution  of  Item  Difficulties 


Test  Length 

10 

Items 

20 

Items 

40 

Items 

Width  and 
Estmator 

Very 

Low 

Low 

Med . 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.8 

.6 

.5 

.5 

1.0 

.8 

.6 

.5 

1.0 

.8 

.6 

.4 

Traditional 

.3 

.3 

.3 

.2 

.6 

.6 

.4 

.3 

.4 

1.0 

.4 

.6 

Jackknife 

1.0 

.7 

.7 

.7 

1.0 

.8 

.6 

.5 

1.0 

.8 

.6 

.5 

AMT- Jackknife 

1.0 

1.0 

1.0 

1.0 

.9 

1.0 

1.0 

1.0 

.8 

.8 

1.0 

1.0 

WIM 

.8 

.6 

.5 

.5 

.9 

.8 

.6 

.5 

1.0 

.8 

.6 

.4 

4  Logits 

Rasch 

.9 

.6 

.2 

.7 

.7 

1.0 

.2 

.8 

1.0 

.8 

.2 

.5 

Traditional 

.4 

.4 

.1 

.4 

.4 

.8 

.2 

.5 

.6 

1.0 

.1 

.6 

Jackknife 

1.0 

.6 

.2 

.9 

.7 

1.0 

.2 

1.0 

1.0 

.8 

.2 

.5 

AMT -Jackknife 

.6 

1.0 

1.0 

1.0 

.4 

.9 

1.0 

1.0 

.4 

.6 

1.0 

1.0 

WIM 

.8 

.4 

.2 

.5 

1.0 

1.0 

.2 

.7 

* 

* 

* 

* 

all  raw  scores  of  the  same  value,  regardless  of  how  that  raw  score  was  obtained, 
but  yielded  a  poor  goodness-of-f it  statistic  for  misfitting  persons.  WIM  reac¬ 
ted  by  shortening  the  test,  indicating  in  essence  that  only  a  small  portion  of 
the  test  response  vector  obeys  the  Rasch  model.  The  Jackknife  methods  reacted 
by  regressing  the  scores  toward  zero  (increasing  bias  but  reducing  variance  of 
the  sampling  distribution)  while  increasing  the  standard  error,  thus  signif lying 
that  the  information  on  the  individual  was  small. 

More  Guessing 

Next,  the  same  three  distibutions  of  item  difficulty  were  considered,  but 
this  time  with  a  great  deal  of  guessing.  Tables  8,  9,  and  10  show  the  results 
when  guessing  invitation  and  inclination  were  both  set  to  .9.  This  yielded  a 
situation  in  which  a  person  guessed  whenever  he/she  did  not  know  the  answer  and 
was  identical  to  the  situation  posited  in  the  derivation  of  the  Traditional 
guessing  correction.  In  this  situation  it  would  be  expected  that  the  Tradition¬ 
al  method  would  excell;  and  it  did  perform  well,  but  only  when  the  test  length 
was  great  enough  to  overcome  its  small  sample  inefficiency. 

Once  again,  the  same  pattern  of  results  emerged.  For  short  tests  the  Jack¬ 
knifing  schemes  worked  best,  with  the  edge  always  in  the  direction  of  the  AMT. 

As  tests  got  longer  (40  items),  the  Traditional  guessing  correction  began  to  work 
quite  well.  WIM,  on  the  other  hand,  was  disappointing,  doing  scarcely  better 
than  just  a  straight  Rasch  estimate.  This  must  be  interpreted,  however.  WIM 
reduces  measurement  bias  quite  well;  but  in  doing  so,  it  also  decreases  test 
length  substantially.  It  could  be  argued  that  the  length  of  the  test  evaluated 


Table  8 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  =  .9,  Guessing  Inclination  ■  .9,  Glitch  ■  0, 
for  Five-Choice  Items  with  a  Uniform  Distribution  of  Item  Difficulties 


Test 

Length 

~~W 

Items 

20 

Items 

4o 

Items 

Width  and 
Estimator 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.8 

.8 

.7 

.6 

1.0 

.9 

.8 

.8 

.7 

.5 

.8 

.6 

Traditional 

.4 

.4 

.4 

.5 

.7 

.9 

1.0 

.8 

1.0 

1.0 

1.0 

1.0 

Jackknife 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.9 

1.0 

.7 

.5 

.8 

.7 

AMT-Jackknife 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.9 

1.0 

.7 

.5 

.8 

.7 

WIM 

.8 

.8 

.7 

.6 

1.0 

.9 

.8 

.8 

.7 

.5 

.8 

.6 

4  Logits 

Rasch 

1.0 

.9 

.7 

.6 

.9 

1.0 

.6 

.6 

.6 

.5 

.7 

.7 

Traditional 

.6 

.8 

.5 

.4 

.9 

.9 

1.0 

.6 

1.0 

1.0 

1.0 

1.0 

Jackknife 

1.0 

1.0 

.8 

.9 

.8 

1.0 

.7 

.7 

.6 

.5 

.7 

.8 

AMT-Jackknife 

1.0 

1.0 

1.0 

1.0 

.8 

1.0 

.8 

1.0 

.5 

.5 

.8 

1.0 

WIM 

.8 

.8 

.6 

.5 

1.0 

.9 

.6 

.5 

.7 

.6 

.7 

.6 

Table  9 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  =  .9,  Guessing  Inclination  *  .9,  Glitch  ■  0, 
for  Five-Choice  Items  with  a  Gaussian  Distribution  of  Item  Difficulties 


_ _ Test  Length _ 

10  Items  20  Items  40  Items 


Width  and 
Estimator 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.9 

.8 

.7 

.5 

1.0 

.9 

.8 

.7 

.7 

.5 

.7 

.6 

Traditional 

.5 

.4 

.4 

.4 

.6 

.8 

.9 

.7 

1.0 

1.0 

1.0 

1.0 

Jackknife 

1.0 

1.0 

.9 

.9 

1.0 

1.0 

.9 

.9 

.7 

.5 

.8 

.7 

AMT-Jackkni fe 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.7 

.6 

.8 

.8 

WIM 

.9 

.8 

.7 

.5 

1.0 

.9 

.8 

.7 

.7 

.5 

.7 

.6 

4  Logits 

Rasch 

1.0 

.9 

.6 

.4 

1.0 

1.0 

.6 

.5 

.6 

.5 

.6 

.6 

Traditional 

.6 

.9 

.5 

.4 

1.0 

.8 

1.0 

.5 

1.0 

1.0 

1.0 

.8 

Jackknife 

1.0 

1.0 

.8 

.8 

.9 

1.0 

.7 

.7 

.6 

.5 

.7 

.6 

AMT-Jackkni fe 

1.0 

1.0 

1.0 

1.0 

.9 

1.0 

.8 

1.0 

.6 

.5 

.8 

1.0 

WIM 

1.0 

.8 

.5 

.4 

1.0 

.9 

.6 

.4 

1.0 

.9 

.7 

.5 
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by  WIM,  after  eliminating  items  with  large  residuals,  corresponds  to  the  test 
that  the  testeee  actually  took.  However,  the  reduced  test  length  has  the  con¬ 
comitant  effect  of  increasing  the  standard  error  of  measurement,  and  this  causes 
its  disappointing  showing  in  the  efficiency  statistic. 

Table  10 

Relative  Efficiencies  of  Five  Estimators  on  Tests  of  Various  Lengths  and  Widths 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High),  with 
Guessing  Invitation  =  .9,  Guessing  Inclination  =  .9,  Glitch  =  0, 
for  Five-Choice  Items  with  a  Bimodal  Distribution  of  Item  Difficulties 


Test 

Length 

10 

Items 

20 

Items 

40 

Items 

Width  and 
Estimator 

Very 

Low 

Low 

Med . 

High 

Very 

Low 

Low 

Med. 

High 

Very 

Low 

Low 

Med. 

High 

2  Logits 

Rasch 

.7 

.6 

.5 

.4 

1.0 

.9 

.6 

.5 

.6 

.5 

.6 

.5 

Traditional 

.4 

.4 

.2 

.4 

.8 

.8 

.8 

.4 

1.0 

1.0 

1.0 

.9 

Jackkni fe 

.9 

.7 

.6 

.6 

1.0 

.9 

.7 

.6 

.6 

.5 

.7 

.6 

AMT-Jackkni fe 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

.6 

.6 

1.0 

1.0 

WIM 

.7 

.6 

.5 

.4 

.9 

.9 

.6 

.5 

.6 

.5 

.6 

.5 

4  Logits 

Rasch 

1.0 

.6 

.2 

.4 

.6 

.9 

.2 

.5 

.3 

.4 

.2 

.3 

Traditional 

.9 

.5 

.2 

.4 

.7 

1.0 

.4 

.6 

1.0 

1.0 

.4 

.5 

Jackknife 

1.0 

.6 

.3 

.6 

.6 

.9 

.2 

.6 

.5 

.4 

.2 

.3 

AMT-Jackkni fe 

.9 

1.0 

1.0 

1.0 

.4 

1.0 

1.0 

1.0 

.2 

.4 

1.0 

1.0 

WIM 

1.0 

.5 

.2 

.4 

1.0 

.9 

.2 

.5 

* 

* 

* 

* 

Guessing  plus  Glitching 


Since  the  distribution  of  difficulties  did  not  appear  to  have  much  effect 
on  the  behavior'  of  the  various  estimators,  the  remainder  of  the  Results  reported 
will  be  confined  to  one  or  the  other  of  the  distributions,  with  only  side  com¬ 
ments  if  the  results  differ  substantially  when  another  distribution  was  used. 
(Incidentally,  for  an  extremely  bimodal  distribution  in  which  all  items  are 
piled  up  at  the  extremes,  the  AMT  will  not  work  at  all). 

Table  11  shows  the  reaction  of  the  various  estimators  to  glitch  of  .1  over 
several  test  widths  and  for  different  amounts  of  guessing.  There  were  no  sur¬ 
prises.  The  deterioration  of  performance  of  the  Jackknifing  estimators  with 
increased  test  width  is  visible  but  not  severe.  The  AMT-Jackkni fe  was  always 
superior  to  the  standard  Jackknife.  Under  all  conditions,  Jackknifing  seemed  to 
be  the  best  choice  for  higher  ability  individuals.  Jackknifing  also  works  rath¬ 
er  well  for  correcting  guessers,  but  the  other  methods  may  be  better.  In  this 
table  there  are  only  reported  results  for  test  lengths  of  20,  but  this  is  repre¬ 
sentative  of  the  general  findings.  The  Jackknifing  methods  did  relatively  less 
well  with  a  test  length  of  40  and  relatively  better  with  a  test  length  of  10. 


.  <a 
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Table  11 

Relative  Efficiencies  of  Various  Estimators  of  Ability  for  a  Test  with  20  Items 
Whose  Difficulties  are  Uniformly  Distributed 
for  Four  Ability  Levels  (Very  Low,  Low,  Medium,  High), 
with  a  Random  Noise  Component  of  10%  (Glitch  ■  .1) 

(100  Entries  Sampled  Per  Cell  in  Design) 


Amount  of 

Test  Width 

Guessing 

2  Logits 

4  Logits 

6  Logits 

(V.C) 

Very 

Very 

Very 

and  Estimator 

Low  Low  Med.  High 

Low  Low  Med .  High 

Low  Low  Med .  High 

(0,0) 


Rasch 

1.0 

.9 

.9 

.8 

1.0 

.8 

.8 

.9 

.7 

.9 

.5 

.9 

Traditional 

.2 

.1 

.2 

.3 

.4 

.1 

.2 

.3 

.8 

.2 

.1 

.3 

Jackknife 

1.0 

1.0 

1.0 

1.0 

.9 

.9 

.9 

1.0 

.6 

1.0 

.6 

1.0 

AMT- Jackkni f e 

1.0 

1.0 

1.0 

1.0 

.9 

1.0 

1.0 

1.0 

.4 

.9 

1.0 

.9 

WIM 

1.0 

.9 

.9 

.8 

1.0 

.6 

.8 

.8 

1.0 

.7 

.3 

.7 

5, .5) 

Rasch 

1.0 

.9 

.9 

.8 

.6 

1.0 

.8 

.8 

.4 

.9 

.6 

.8 

Traditional 

.9 

.5 

.6 

.4 

1.0 

.6 

.6 

.3 

1.0 

.7 

.4 

.4 

Jackknife 

1.0 

1.0 

1.0 

1.0 

.6 

1.0 

.9 

.9 

.4 

.9 

.6 

1.0 

AMT- Jackknife 

1.0 

1.0 

1.0 

1.0 

.6 

1.0 

1.0 

1.0 

.3 

.8 

1.0 

1.0 

WIM 

1.0 

.9 

.9 

.8 

.7 

1.0 

.8 

.6 

.8 

1.0 

.5 

.6 

9,-9) 

Rasch 

1.0 

.9 

.9 

.8 

.7 

.9 

.8 

.7 

.3 

.7 

.6 

.9 

Traditional 

.8 

1.0 

.7 

.5 

1.0 

1.0 

.7 

.4 

1.0 

1.0 

.5 

.3 

Jackknife 

1.0 

1.0 

1.0 

1.0 

.7 

.9 

.9 

.9 

.3 

.7 

.6 

1.0 

AMT-Jackknife 

1.0 

1.0 

1.0 

1.0 

.6 

1.0 

1.0 

1.0 

.3 

.7 

1.0 

.8 

WIM 

1.0 

.9 

.9 

.7 

.8 

1.0 

.8 

.8 

.7 

.9 

.4 

.6 

True/False  Tests 


When  the  number  of  alternatives  was  reduced  from  five  to  two,  much  the  same 
results  were  found.  With  no  guessing  the  Jackknifing  methods  did  best,  with  an 
edge  to  the  AMT.  As  guessing  became  increasingly  prevalent,  the  Traditional 
correction  scheme  worked  better.  It  was  still  found,  however,  that  for  high 
abilities  the  AMT  method  was  superior  in  efficiency  to  all  others. 

Standard  Errors 


The  Rasch  model  standard  error  is 

Rasch  (SE)  =  1/{I.[P..(1  -  P..)]}1*  Ul] 

J  •'if 

for  each  ability  level  i.  This  accurately  reflects  what  was  observed  empirical¬ 
ly  for  the  Rasch  ability  estimates  in  the  simulations.  When  there  was  no  guess¬ 
ing,  the  standard  deviations  of  the  sampling  distributions  was  about  what  this 
equation  would  predict.  It  underpredicted  the  variability  observed  when  there 


was  noise.  The  WIM  standard  error  is  calculated  in  the  same  way  as  the  Rasch 
except  for  a  test  of  reduced  length.  This  seems  to  accurately  reflect  reality 
for  the  situations  tested. 


by 


The  Jackknife  standard  error  is  calculated  directly  from  the  pseudovalues 


Jackknife  (SE)  =  [£ .  <  a  .*  -  a*  >  2  /  [12] 

3  3 

<  U  -  1  )L  > 

and  is  known  to  be  a  conservative  estimator.  This  is  certainly  true  in  this 
case.  It  tended  to  overestimate  the  actual  standard  error  by  about  50%  for  test 
lengths  of  10,  by  25%  for  test  lengths  of  20,  but  was  just  about  right  for  test 
lengths  of  40. 

Although  there  are  several  candidates  for  estimating  the  standard  error  of 
the  AMT,  the  investigations  of  the  authors  are  insufficient  to  be  able  to  recom¬ 
mend  one  at  this  time.  It  seems  reasonable  to  use  the  corrected  Jackknife  stan¬ 
dard  error  until  a  better  choice  is  found.  The  Jackknife  standard  error  will 
almost  certainly  be  conservatively  large. 

Conclusions 


This  investigation  sought  to  find  and  test  alternative  methods  for  e'timat- 
ing  ability  under  the  Rasch  model  in  the  face  of  plausible  noise.  This  was  done 
by  using  some  recent  developments  in  robust  estimation  without  adding  parameters 
to  the  model,  thus  retaining  the  Rasch  model's  attractive  attributes.  It  was 
found  that  gains  in  recovering  abilities  in  the  presence  of  guessing  and  unto¬ 
ward  responses  of  other  kinds  can  be  obtained  through  the  use  of  a  robustified 
Jackknife.  But  it  was  also  found  that  specially  developed  models  aimed  at  the 
lower  end  of  the  ability  continuum  may  be  able  to  accomplish  this  better  than 
these  general  tools.  WIM  worked  when  there  was  guessing  and  aided  in  increasing 
the  accuracy  of  estimation  for  low-ability  testees.  The  Traditional  method 
worked  when  there  was  much  guessing,  the  test  was  long,  and  the  ability  of  the 
testees  was  low. 

A  surprising  finding  was  that  for  short  tests  of  10  or  20  items,  the  Jack¬ 
knife  estimators,  with  a  significant  edge  to  the  AMT  version,  yielded  better 
estimates  of  ability  than  the  maximum  likelihood  estimator,  even  when  precondi¬ 
tions  for  the  Rasch  model  held.  This  increase  in  efficiency  of  estimation  is 
especially  important  for  those  applications  of  latent  trait  models  that  use  a 
limited  number  of  measures  obtained  about  a  person  as  a  de  facto  test  (see, 
e.g.,  the  analysis  of  parole  data  in  Perline,  Wright,  &  Wainer,  1979).  In  these 
circumstances  the  number  of  items  cannot  be  easily  increased,  and  the  only  al¬ 
ternative  is  to  improve  the  estimate  of  ability  through  other  means.  Thissen 
i  (1976)  attempted  to  do  this  by  using  a  method  Bock  (1972)  developed  for  wrong 

answers,  but  this  is  very  expensive  computationally  and  only  applicable  to  mul¬ 
tiple-choice  items.  Super-efficient  estimators  may  also  be  useful  in  such  ap¬ 
plications  as  adaptive  testing. 
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The  simulations  performed  were  very  extensive;  nevertheless,  considerably 
more  research  is  necessary.  A  careful  study  of  estimators  of  standard  error  is 
critical,  as  are  the  distributional  properties  of  the  Jackknifed  estimators. 
Robust  estimators  have  not  been  used  in  conjunction  with  the  Jackknife  before, 
so  nothing  is  known  about  that  distribution.  The  authors  believe  that  Jackknife 
estimates  are  t-distr ibuted  (although  there  is  difficulty  in  determining  the 
effective  degrees  of  freedom).  It  seems  reasonable,  therefore,  to  suppose  that 
the  robust  Jackknife  will  have  a  similar  symmetric  (albeit  tighter)  distribu¬ 
tion.  This  suggests  that  the  Jackknife  estimates  of  standard  error  for  the  AMT 
estimator  are  conservative.  Just  how  conservative  these  actually  are,  however, 
awaits  further  investigation. 

A  second  area  of  investigation  that  is  still  incomplete  is  goodness-of-f it 
tests.  Substituting  robust  estimates  of  ability  into  the  usual  goodness-of-fit 
equations  should  yield  a  conservative  estimate  more  realistic  than  those  usually 
obtained  (which  benefit  from  capitalization  on  chance).  But  it  is  not  known  to 
what  extent  the  asymptotic  properties  of  such  fit  statistics  derived  and/or  de¬ 
scribed  by  Anderson  (1973),  Fischer  (1974),  Martin-Lof  (1974),  and  Wright  and 
Stone  (1979)  apply. 

The  finding  of  improved  estimation  efficiency  is  an  intriguing  one.  Lewis 
(1970)  pointed  out  that  although  maximum  likelihood  estimates  of  location  param¬ 
eters  of  ogive  functions  are  asymptotically  identical  to  minimum  chi-square  es¬ 
timates,  they  can  be  quite  different  for  small  samples.  Neither  makes  any 
claims  for  small  sample  efficacy,  but  what  is  surprising  is  how  large  "small" 
can  be  and  how  much  of  an  improvement  can  be  made  using  an  alternative  proce¬ 
dure.  Lewis  found  that  asymptotically  optimal  procedures  did  especially  poorly 
in  estimating  accurate  confidence  intervals  around  the  location  parameter.  Per¬ 
haps  this,  too,  is  an  area  in  which  the  AMT-Jackkni fe  will  prove  useful.  The 
questions  are  clear  and  important,  and  the  methodology  for  answering  them  is 
straightforward. 

There  are  a  number  of  other  estimators  that  may  improve  performance  still 
more.  For  example,  Ramsay  (1977)  found  that  the  Eg  estimator  has  some  advan¬ 
tages  over  the  AMT.  Novick  (1979)  has  suggested  several  Bayesian  estimators 
that  may  have  promise. 

The  main  finding  of  this  study  is  that  for  short  tests  the  asymptotic  prop¬ 
erties  of  maximum  likelihood  estimators  are  not  fully  realized.  Other  methods 
increase  efficiency.  In  addition,  these  other  estimators  can  correct  for  noise 
in  the  data,  such  as  guessing,  and  thus  can  increase  validity.  The  AMT-Jack- 
knife  may  not  be  the  best  estimator  of  its  type  that  can  be  derived.  Perhaps 
other  variations  on  this  theme  can  go  even  further  in  the  direction  of  super-ef¬ 
ficiency.  Nevertheless,  the  AMT -Jackknife  does  seem  to  deal  well  with  the  prob¬ 
lem  of  guessing,  which  is  so  poorly  handled  by  estimation  of  a  lower  asymptote 
of  the  item  characteristic  curve. 
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In  a  large  test  administration  a  few  examinees  may  be  so  unlike  other  exam¬ 
inees  that  their  multiple-choice  aptitude  test  scores  have  limited  value  as 
ability  measures.  A  particularly  transparent  example  is  provided  by  a  hypothet¬ 
ical  low-ability  copier  who  copies  half  of  his/her  answers  from  a  much  more  able 
neighbor.  Other  test  anomalies  include: 

1.  Improperly  coached  examinees  who  are  shown  the  answers  to  some  items 
before  the  exam  begins, 

2.  Examinees  with  high  ability  but  atypical  schooling  or  low  English  flu¬ 
ency,  ' 

3.  Exceptionally  creative  examinees  who  discover  novel  interpretations  for 
some  items, 

4.  Examinees  who  are  very  conservative  in  their  use  of  partial  informa¬ 
tion,  and 

5.  Examinees  who  make  alignment  errors  on  their  answer  sheet  over  a  block 
of  items,  answering  the  9th  item  in  the  10th  place,  the  10th  item  in 
the  11th  place,  and  so  forth. 

In  each  of  these  cases  it  can  be  argued  that  (1)  the  test  score  is  not  an  appro¬ 
priate  measure  of  ability  and  (2)  the  item-by-item  pattern  of  answers  may  be 
recognizably  unusual.  For  example,  the  hypothetical  low-ability  copier  seems 
likely  to  have  many  easy  items  incorrect  and  many  difficult  items  correct,  rela¬ 
tive  to  typical  examinees.  A  second  example  of  an  inappropriate  test  score  and 
an  unusual  answer  pattern  occurring  together  is  provided  by  the  hypothetical 
examinee  with  an  alignment  error.  He  or  she  will  most  likely  have  a  block  of 
consecutive  items  incorrect  and  an  unusual  answer  pattern  within  the  block. 

Thus,  a  multiple-choice  aptitude  test  may  be  a  dubious  measure  of  ability 
due  to  any  one  of  many  (although  possibly  rare)  causes.  In  at  least  some  cases, 
the  item-by-item  response  pattern  may  contain  evidence  of  this  fact.  This  paper 
considers  the  problem  of  using  answer  patterns  to  recognize  inappropriate  test 
scores . 
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Appropriateness  Measurement:  Its  Objectives  and  Limitations 

Appropriateness  measurement  is  a  general  approach  to  the  problem  caused  by 
inappropriate  test  scores.  Its  purpose  is  simply  to  identify  inappropriate  test 
scores.  It  is  limited  to  cases,  such  as  those  noted  above,  in  which  inappropri¬ 
ate  test  scores  and  unusual  answer  patterns  tend  to  co-occur.  Appropriateness 
measurement  is  implemented  by  statistics,  called  appropriateness  indices,  that 
measure  the  degree  to  which  an  examinee's  answer  pattern  is  "unusual,**  i.e., 
unlike  the  pattern  expected  from  typical  examinees. 

In  appropriateness  measurement  studies,  examinees  are  sorted  into  two 
groups:  (1)  examinees  with  very  unusual  answer  patterns,  as  indicated  by  very 

extreme  index  values  and  (2)  other  examinees,  i.e.,  examinees  with  typical  index 
values.  Appropriateness  measurement  is  successful  to  the  extent  that  the  group 
of  examinees  with  extreme  index  values  has  a  larger  proportion  of  examinees  with 
inappropriate  scores  than  the  group  with  typical  index  values. 

Background 


The  first  large-scale  appropriateness  measurement  study  was  reported  by 
Levine  and  Rubin  (in  press).  This  study,  reviewed  below,  provides  the  back¬ 
ground  r.nd  context  for  the  theoretical  developments  and  empirical  results  re¬ 
ported  in  this  paper.1 

Levine  and  Rubin  (in  press)  identified  three  types  of  appropriateness  in¬ 
dices  and  reported  positive  empirical  findings  with  these  indices.  However,  the 
generality  of  their  findings  is  limited  by  properties  of  their  data  set.  In 
particular,  their  data  were  simulated,  the  simulation  parameters  were  available 
for  use  in  defining  appropriateness  indices,  and  aberrant  examinees  were  un¬ 
equivocally  identified. 

In  this  paper  actual  and  simulated  data  are  used  to  attack  three  problems 
raised  by  the  Levine  and  Rubin  study,  namely: 

1.  Estimated  versus  known  item  parameters.  With  simulated  data,  item  pa¬ 
rameters  (e.g. ,  item  difficulties)  are  known  and  need  not  be  estimated 
prior  to  computing  appropriateness  indices.  With  actual  data,  parame¬ 
ters  must  be  estimated.  How  seriously  will  appropriateness  measurement 
be  affected  by  estimation  errors? 

2.  Unidentified  aberrants.  In  a  simulation  study,  atypical  examinees  art 
unequivocally  identified  and  a  sample  of  truly  normal  examinees  is 
available  for  item  parameter  estimation.  With  actual  data  an  unknown 
proportion  of  unidentified  aberrants  will  be  included  in  each  large 
sample  of  nominally  normal  examinees.  How  will  the  presence  of  these 
aberrants  affect  parameter  estimation  and,  consequently,  appropriate¬ 
ness  measurement? 


xFor  independent  contemporary  work  that  appears  similar  in  conception,  see  Flier 
(1977).  For  possibly  related  research  based  on  classical  test  theory,  see 
Ghiselli  (1956;  1960a;  1960b)  and  Donlon  and  Fischer  (1968). 
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3.  Model  validity.  Simulated  data  conform  precisely  to  the  psychometric 
model  used  to  generate  data  and  to  formulate  appropriateness  indices. 
There  will  be  reliable  contradictions,  however,  to  the  assumptions  of 
any  tractable,  nontrivial  psychometric  model  in  a  large  sample  of  actu¬ 
al  data.  Are  currently  available  psychometric  models  sufficiently  val¬ 
id  to  support  appropriateness  measurement  with  actual  data? 

The  results  presented  will  show  (1)  that  appropriateness  indices  are  not 
seriously  degraded  by  the  use  of  estimated  parameters;  (2)  that  appropriateness 
indices  are  not  seriously  degraded  even  when  a  relatively  large  proportion  of 
aberrant  examinees  is  initially  {and  improperly)  treated  as  normal;  and  (3)  that 
detection  rates  with  actual  test  data  are  comparable  to  detection  rates  with 
simulated  data. 


Review  of  Test  Theories 
and  Basic  Appropriateness  Measurement 

Appropriateness  measurement  involves  a  two-stage  process:  a  test  norming 
or  item  parameter  estimation  stage,  followed  by  a  person  measurement  or  index 
computation  stage.  This  distinction  parallels  the  separation  of  parameters  in 
latent  trait  models  into  (1)  item  or  test  characterizing  parameters,  such  as 
item  difficulties;  and  (2)  individual  difference  or  person  characterizing  param¬ 
eters,  such  as  abilities. 

The  Standard  Model 


In  the  studies  to  be  reported  here,  the  test  norming  stage  is  developed 
around  what  will  be  called  the  standard  model.  This  test  model  is  a  version  of 
the  3-parameter  logistic  model  of  item  response  theory  (Birnbaum,  1968).  Ac¬ 
cording  to  the  standard  model,  an  answer  sheet  is  generated  by  a  two-stage  ex¬ 
periment.  In  the  first  stage  an  ability,  0,  is  sampled.  The  second  stage  is  a 
sequence  of  independent  binary  random  variables.  These  are  the  item  scores, 
coded  from  the  observed  answer  sheet  with  "1"  denoting  a  correct  response  and 
"0"  an  incorrect  response.  (The  ability  0  is  not  observed  and  the  distribution 
of  abilities  is  neither  specified  nor  estimated  in  these  studies.) 


After  some  notation  is  introduced,  the  essential  features  of  the  standard 
model  can  be  summarized  with  two  equations.  Let  denote  the  examinee's 

answer  pattern.  Thus,  U^j),  j_  =  1 »  2,  ....  N,  is  the  vector  of  item  scores, 


,(«7) 


=  <  u 


(J)  (j) 


(J) 


for  a  test  composed  of  ri  items.  Let  P^C©)  denote  the  regression  of  the  i^  item 
score  on  ability,  i.e.,  P£ (0 )  =  (iiij0).  For  a  given  0,  P{(0)  can  be  interpreted 
as  the  conditional  probability  of  an  examinee,  randomly  selected  from  all  exam¬ 
inees  with  ability  0,  correctly  answering  the  ith  item.  With  this  notation,  the 
conceptualization  of  the  second  stage  of  the  answer  sheet  generation  process  can 
be  expressed  by  the  following  equation,  which  is  known  as  the  "local  indepen¬ 
dence"  assumption  of  latent  trait  theory: 
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Prob  {J/(J)|0  -  0}  *  n  p .  (0  )  1  [1  -  P  •  (9  )  1  1  [1] 

i-1  1  1 

In  words,  Equation  1  states  that  item  responses  are  conditionally  independent. 
The  independence  of  different  examinees  implicit  in  the  first  stage  of  the  an¬ 
swer  sheet  generation  process  is  expressed  in  Equation  2: 


prob  (i/(1),  |e(l),  e(2), _  0(AO  } 


n  Prob{y(j) | 0 }  . 
J=1 


To  facilitate  the  test  norming  stage,  each  P^  is  assumed  to  have  the  func¬ 
tional  form 

P.(0)  =  a.  +  (1  -  a.)  {1  +  exp[-a.(9  -  &.)]}  1  [3] 

‘V  Is  l'  Is  Is 

for  some  positive  ,  real  b^ ,  and  with  0  <_  c_^  1.  This  asserts  that  P^  is 

S-shaped  with  a  lower  asymptote  of  c_£  and  an  upper  asymptote  of  unity.  The  lo¬ 
cation  and  scale  parameters  b_^  and  a_^  express  differences  between  items  in  dif¬ 
ficulty  (b^)  and  ability  to  discriminate  between  low-  and  high-ability  examinees 
(aj).  This  particular  functional  form  is  conventional,  is  widely  used,  and  has 
been  supported  in  nonparametric  studies  (Levine  &  Saxe,  1976). 

"Test  norming"  consists  of  estimating  the  numerical  values  for  a^,  b^,  and 
c; .  This  is  done  by  selecting  a  large  sample  of  N  presumably  normal  examinees; 

observing  their  answer  patterns  U^), , . .  ;  and  finding  a  set  of  a's, 

b's,  c's,  and  0's  that  maximizes  the  likelihood  function  in  Equation  2.  The 
Interest  in  the  test  norming  stage  is  in  the  item  parameters  a^,  b^,  and  c_;  howev¬ 
er,  the  0's,  must  also  be  estimated  in  the  current  procedure. 

Lord's  LOGIST  algorithm  (Wood  &  Lord,  1976;  Wood,  Wingersky,  &  Lord,  1976) 
can  be  used  to  maximize  Equation  2.  This  program  has  been  vigorously  criticized 
by  Wright  and  his  associates  (e.g.,  Wright,  1977).  In  fact,  Wright  has  ques¬ 
tioned  whether  any  algorithm  can  be  designed  to  estimate  the  parameters  of  the 
standard  model;  the  relevance  of  the  results  of  the  present  studies  to  these 
criticisms  is  summarized  in  the  last  section  of  this  paper. 

The  test  norming  stage  can  be  thought  of  as  specifying  a  continuum  of  mod¬ 
els  or  statistical  characterizations  of  typical  examinees  by  using  test  re¬ 
sponses  of  the  first  N  nominally  normal  examinees  in  the  first  stage  to  estimate 
a  set  of  item  parameters.  These  parameter  estimates  can  be  substituted  in  Equa¬ 
tion  1  to  obtain  an  explicit  formula  for  the  likelihood  of  a  new  pattern  of  an¬ 
swers,  say  u(N+D.  An  intuition  that  had  guided  much  of  the  authors'  current 
research,  and  which  will  now  be  used  to  introduce  the  person  measurement  stage 
is  this:  Suppose  for  all  values  of  0,  U^N+l)  appears  improbable  in  the  sense 
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that  Prob(y(N+1) | 6)  is  very  small.  Then,  is  badly  fit  by  all  models  for 

individual  data  developed  in  the  test  norming  stage.  ” ™ 

In  the  person  measurement  stage  the  item  parameters  are  treated  as  known 
and  an  index  of  goodness  of  fit  is  computed  for  each  person's  answer  pattern. 

The  simplest  index,  L0 ,  is 

L Q  =  log  max  Prob  (i/^+1^|6)  .  [4] 

e 

This  index  will  be  small  if  Prob(U (N+D | 0)  is  small  for  values  of  6.  A  small 
value  of  Lo  could  result  if  many  incorrectly  answered  easy  items  rule  out  high 
values  of  0  and  many  correctly  answered  difficult  items  rule  out  low  values  of 
0.  L0  detects  aberrance  surprisingly  well.  To  improve  upon  it,  a  model  for 
aberrant  data  and  two  measures  for  the  degree  of  aberrance  will  be  specified  in 
the  next  section. 

Variable  Ability  Models  and  Appropriateness  Indices 

.  2  .  . 

L0 ,  like  X  ,  is  sensitive  to  any  type  of  poorness  of  fit  for  the  standard 
model.  A  generalization  of  the  standard  model  is  needed  to  detect  aberrations 
of  the  specific  kind  that  is  of  interest. 

In  many  of  the  most  important  types  of  test  inappropriateness,  the  aberrant 
examinee  behaves  as  if  his  or  her  ability  were  fluctuating  from  item  to  item. 
Thus,  the  low-ability  cheater  appears  to  have  a  much  higher  ability  for  those 
items  on  which  he  or  she  has  been  coached.  The  high-ability,  low-English- 
fluency  candidate  behaves  as  if  he  or  she  had  low  ability  on  linguistically  de¬ 
manding  items. 

In  the  standard  model  the  examinee's  ability,  9,  is  constant  across  items. 
In  variable  ability  models,  introduced  by  Levine  and  Rubin  (1979),  the  exam- 
inee's  ability  is  conceptualized  as  varying  from  item  to  item.  For  example,  in 
the  Gaussian  model,  each  examinee  is  characterized  by  a  pair  of  parameters: 
central  ability,  0O ,  and  ability  variance,  a2.  (Note  that  the  standard  model 
utilizes  a  single  individual  difference  parameter,  0.)  According  to  the  Gaussian 
model,  the  examinee  has  ability  0!  on  Item  1,  02  on  Item  2 ,  ..,  and  0n  on  Item 
The  model  asserts  that  for  given  0O  and  a2,  the  0^  are  independent  identi¬ 
cally  distributed  normal  or  Gaussian  random  variables  with  mean  =  0o  and  vari¬ 
ance  =  o2 .  To  specify  the  model  more  precisely  the  likelihood  function 
Prob(u|0o,  O2 ) ,  which  gives  the  conditional  probability  of  a  vector  U  of  item 
responses,  is  written  and  simplified  as  follows: 

00  00  n  u . 

Prob  (£/ 1  0  0 ,  o2)  =  /.../  n  %  [5] 

-  »  -  °°  £  =  1 

[i-pi(ei>] 1  Me^eo.o2)^} 
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n  00  U.  *-«i  2 

n  f  Pi(t)  l  -  Pi(t)]  <Kt;e0,a  )dt 
i  =  1  -» 


where  4>(t;  60,  a2)  is  the  Gaussian  density 


1  ( £ ;  e  0 ,  o2 )  = 


0^27 


exp 


r 

2 

->5 

t-Q  o 

o 

- 

[5] 


[6] 


Equation  5  is  analogous  to  Equation  1  for  the  standard  model. 

To  obtain  a  second  appropriateness  index  testing  for  a  specific  departure 
from  the  standard  model,  the  maximum  likelihood  ratio  test  statistic  is  comput- 


ed : 

LR  =  L  - 
n 

'  *0 

[7] 

where 

Ln  =  108 

max  Prob(I/| 
0O,O 

9  0 »°2) 

and 

L0  =  lo8 

max  ProbC^I 

9), 

0 

as  before.  LR  measures  the  degree  to  which  a  variable  ability  model  provides  a 
better  fit  to  the  observed  pattern  of  responses  than  the  standard  model. 

...  /v 

The  final  appropriateness  index  or  measure  of  goodness  of  fit  is  o,  the 
maximum  likelihood  estimate  of  the  ability  standard  deviation.  This  index  is 
obtained  by  maximizing  Prob(U|9o,  o2 )  with  respect  to  both  0o  and  o.  The  stan¬ 
dard  model  is  a  special  case  of  the  Gaussian  model  in  the  sense  that  Prob(U|0) 
is  the  limit  of  Prob(U]0o,  o  )  as  0  decreases  to  zero.  Consequently,  a  small  0 
can  be  interpreted  as  indicating  a  small  degree  of  aberrance. 

Index  Evaluation  and  Receiver  Operator  Curves 

Each  application  of  appropriateness  measurement  will  have  different  rewards 
for  correctly  identifying  an  inappropriate  score  and  different  penalties  for 
incorrectly  classifying  a  nonaberrant  examinee  as  aberrant.  The  receiver  opera¬ 
tor  curve  (ROC)  of  statistical  decision  theory  provides  a  graphic  way  to  compare 
indices  prior  to  the  specification  of  rewards  and  penalties. 


In  applying  an  appropriateness  index  to  classify  examinees,  a  cutoff  or 
criterion  value  of  the  index  is  specified.  For  the  present  exposition,  assume 
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that  small  index  values  indicate  aberrance.  At  each  criterion  value,  _t,  the 
proportion  of  aberrant  examinees  correctly  identified  as  aberrant  and  the  pro¬ 
portion  of  r  j rmal  examinees  improperly  identified  as  aberrant  can  be  denoted  as 

x(£)  =  proportion  of  normal  examinees  with  index  values  <  jt 
jr(t_)  *  proportion  of  abberrant  examinees  with  index  values  <  t. 

An  ROC  results  from  plotting  the  <  x( t) ,  y(t)  >  pairs  obtained  for  various  val¬ 
ues  of  t.  A  desirable  ROC  is  one  that  rises  sharply  from  the  origin  toward  the 
upper  left-hand  corner  of  the  plot.  In  contrast,  an  ROC  that  lies  along  the 
diagonal  of  the  plot  indicates  a  random  classification  rule.  That  is,  classify¬ 
ing  examinees  or  the  basis  of  flipping  a  coin  yields  a  diagonal  ROC.  Clearly, 
an  ROC  that  indicates  an  effective  detection  procedure  is  one  that  lies  well 
above  the  diagonal. 

Simulation  Procedures  and  Results 


Levine  and  Rubin's  (in  press)  simulation  methods  and  results  are  reviewed 
here,  as  some  of  their  data  is  reanalyzed,  and  their  methods  are  applied  to  new 
data.  To  simulate  a  normal  vector  of  item  scores,  they  first  sampled  an  ability 
0  from  a  normally  distributed  population  with  zero  mean  and  unit  variance.  The 
examinee's  first  item  score  was  generated  by  sampling  a  number  uniformly  dis¬ 
tributed  in  the  unit  interval.  If  the  sampled  number  was  less  than  or  equal  to 
P^O)  from  Equation  3,  then  the  first  item  was  scored  as  correct;  otherwise,  the 

item  was  scored  as  incorrect.  The  remaining  item  scores  were  obtained  by  inde¬ 
pendently  drawing  new  uniformly  distributed  numbers  and  comparing  them  with 
PjCG)  for  i.  *  2,  3,...,n. 

The  parameters  ,  b^ ,  and  c^  utilized  in  Equation  3  were  those  obtained  by 

Lord's  fitting  of  a  3-parameter  logistic  model  to  a  large  sample  of  Scholastic 
Aptitude  Test,  Verbal  Section  data  (SAT-V;  Lord,  1968).  The  actual  simulation 
was  implemented  with  Hambleton  and  Rovenelli's  (1973)  program.  (For  technical 
details  concerning  the  random  number  generators  used,  see  Levine  &  Rubin,  in 
press,  Appendix). 

Aberrant  examinees  were  simulated  by  modifying  simulated  normal  answer 
sheets  in  various  ways.  In  this  paper  concern  is  primarily  with  the  "20%  spuri¬ 
ously  low"  modification,  but  other  modifications  will  also  be  reviewed  briefly. 

To  create  a  spuriously  high  answer  sheet,  20%  of  a  normal  simulated  exam¬ 
inee's  item  responses  were  randomly  sampled  without  replacement.  If  the  sampled 
item  was  originally  scored  as  correct,  it  is  left  unchanged.  If  the  sampled 
item  was  not  correct,  it  is  rescored  as  correct. 

To  create  a  spuriously  low  answer  sheet,  20%  of  a  normal  answer  sheet's 
responses  were  sampled  as  before.  Then,  a  random  number  generator  was  used  to 
simulate  random  guessing  over  the  five  multiple-choie  alternatives.  No  matter 
what  the  original  sampled  item  score  was,  the  sampled  item  was  scored  correct 
with  probability  .20  and  scored  as  incorrect  with  probability  .80. 


_  .V  CaX 
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Levine  and  Rubin  observed  that  the  modification  of  a  normal  answer  pattern 
to  create  a  spuriously  low  answer  pattern  frequently  resulted  in  little  or  no 
change  in  the  actual  response  vector.  Clearly,  if  an  aberrance-producing  pro¬ 
cess  does  not  alter  the  objective  response  pattern,  it  is  futile  to  attempt  to 
detect  the  presence  of  the  process  with  an  appropriateness  index.  Furthermore, 
if  there  is  no  effect  on  the  objective  response  pattern  and  test  score,  there  is 
little  motivation  for  detecting  aberrance.  Consequently,  Levine  and  Rubin  sepa¬ 
rately  analyzed  the  data  from  spuriously  low  examinees  who  had  at  least  10%  of 
their  scores  changed.  Both  correct  responses  changed  to  incorrect  and  incorrect 
responses  changed  to  correct  were  counted  toward  the  10%  figure. 

Levine  and  Rubin  observed  good  aberrance  detection  with  the  total  sample  of 
spuriously  low  examinees  and  excellent  detection  for  both  the  spuriously  high 
examinees  and  the  selected  large-score-change  sample  of  spuriously  low  exam¬ 
inees.  These  generalizations  held  for  all  three  indices. 

There  are  two  additional  results  obtained  by  Levine  and  Rubin  that  merit 
attention.  First,  Levine  and  Rubin  systematically  varied  the  percentage  of 
items  modified,  utilizing  4%,  10%,  20%,  and  40%  treatments.  As  expected,  they 
found  increasing  detectability  as  the  percentage  of  modified  items  was  in¬ 
creased.  A  second  finding  by  Levine  and  Rubin  was  that  the  three  appropriate¬ 
ness  indices — Lo,  LR,  and  o — yield  quite  similar  patterns  of  detectability. 
Interestingly,  no  one  index  was  substantially  better  or  worse  than  the  other 
two.  Consequently,  Levine  and  Rubin  did  not  recommend  using  any  one  particular 
index  in  future  research. 

For  the  present  study  Levine  and  Rubin's  data  were  reanalyzed  using  esti¬ 
mated  item  parameters  instead  of  simulation  parameters,  and  appropriateness  mea¬ 
surement  techniques  were  applied  to  actual  test  data.  Levine  and  Rubin  data 
files  relevant  to  the  present  research  were  as  follows: 


1.  NORMAL  3200:  3,200  simulated  answer  sheets  with  normally  distributed 

abilities;  item  parameters  from  Lord's  (1968)  fitting  of  the  SAT-V; 
items  scored  either  as  correct  or  incorrect. 

2.  NORMAL  2800:  The  first  2,800  records  from  NORMAL  3200. 

3.  LOW  200:  Records  3,001,  3,002, ... ,3,200  from  NORMAL  3200  modified  to 
simulated  ability-unrelated  responding  on  20%  of  the  test  according  to 
the  20%  spuriously  low  modification. 

4.  LOW  102:  102  records  selected  from  LOW  200,  having  at  least  10%  of 

the  simulated  examinee's  original  responses  changed  in  the  spuriously 
low  modification. 

5.  HIGH  200:  Records  2,801,  2 , 802, . . . , 3, 000  from  NORMAL  3200  modified  to 
simulate  cheating  on  20%  of  the  items  according  to  the  20%  spuriously 
high  modification. 
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Study  1:  Estimated  Parameters 

Problem 

Levine  and  Rubin  bypassed  the  first  (test  norming)  stage  of  appropriateness 
measurement.  Instead  of  estimating  item  parameters  from  a  large  sample  of  nor¬ 
mal  examinees,  they  used  the  exact  (simulation)  parameters  to  compute  appropri¬ 
ateness  indices.  Clearly,  in  an  application  with  actual  data,  item  parameters 
must  be  estimated.  How  sensitive  are  indices  to  item  parameter  estimation  er¬ 
ror?  Can  high  detection  rates  be  achieved  with  estimated  parameters? 

Method 

Item  parameters  were  estimated  by  applying  Lord's  maximum  likelihood  algo¬ 
rithm  LOGIST  to  NORMAL  2800  data.  L0  was  computed  for  each  NORMAL  2800  examinee 
by  evaluating  the  likelihood  function  at  the  LOGIST  maximum  likelihood  estimate 
of  ability.  The  LOGIST-estimated  item  parameters  were  then  used  to  compute  L0 
for  the  LOW  102  response  vectors  by  rerunning  LOGIST  for  the  LOW  102  data  with 
all  item  parameters  fixed  at  the  values  obtained  from  NORMAL  2800.  In  this  way 
an  estimated  parameter  L0  appropriateness  index  value  was  obtained  for  each  NOR¬ 
MAL  2800  and  LOW  102  response  vector. 

Results 

A  close  agreement  between  values  of  L0  computed  from  exact  parameter  and 
estimated  parameter  index  values  was  observed.  In  Figure  1  a  bivariate  scatter- 
plot  demonstrates  this  agreement  for  the  critical  LOW  102  sample.  Each  of  the 
102  simulated  examinees  contributes  a  point  to  the  scatterplot.  The  x-coordi- 
nate  of  the  point  is  L0  computed  with  exact  item  parameters;  the  ^-coordinate, 
with  estimated  item  parameters.  If  there  were  perfect  agreement  between  the  two 
measures,  the  points  would  fall  on  the  diagonal  line,  which  has  been  drawn  on 
the  figure.  A  very  slight  tendency  is  observed  for  estimated  index  values  to  be 
smaller  than  exact  index  values  for  the  most  aberrant  examinees  in  LOW  102  (L0 
less  than  -60). 

The  same  close  agreement  was  observed  for  the  NORMAL  2800  sample.  This  is 
shown  in  Figure  2  for  the  first  100  simulated  examinees  in  NORMAL  2800. 

A  more  significant  result  is  shown  in  the  estimated  parameter  L0  ROC  presen¬ 
ted  in  Figure  3.  Note  that  detection  rates  were  high,  even  for  low  false  alarm 
rates.  For  example,  12.7%  of  the  aberrant  examinees  could  be  correctly  classi¬ 
fied  with  an  L0  cutoff  score  that  did  not  misclassify  a  single  NORMAL  2800  exam¬ 
inee.  Further,  21.6%  of  the  aberrant  examinees  were  detected  at  a  false  alarm 
rate  of  .9%  and  47.1%  of  the  aberrants  were  detected  at  a  4.4%  false  alarm  rate. 
Figure  3  is  in  close  agreement  with  Levine  and  Rubin's  exact  parameter  LR  ROC 
computed  from  the  same  data  (Levine  &  Rubin,  in  press,  Figure  8).  In  fact, 
there  is  a  very  small  superiority  for  the  estimated  parameter  ROC;  using  esti¬ 
mated  parameters  there  were  0%,  .9%,  and  4.4%  false  alarms  at  hit  rates  of 
11.8%,  21.6%,  and  47.1%  in  comparison  to  .1%,  .9%,  and  4.8%  false  alarms  at  the 
corresponding  points  in  the  exact  parameter  ROC. 


'  —  X  — 
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Discussion 


Figure  1 

Bivariate  Scatterplot  of  L0  Computed  from  Exact  and 
Estimated  (from  NORMAL  2800  Data)  Item  Parameters 
for  LOW  102  Response  Vectors 


Lp  Computed  from  F.xact  Item  Parameters 


The  results  may  initially  seem  surprising  in  view  of  Lord's  (1975a)  study  of 
the  disparity  between  LOGIST-estiraated  item  parameters  and  simulation  parame¬ 
ters.2  Lord  found  much  more  variation  around  the  diagonal  in  his  bivariate 
plots  of  simulation  item  parameters  versus  estimated  item  parameters  than  ap¬ 
pears  in  the  plots  of  estimated  parameter  L0  versus  simulation  parameter  Lo. 

The  discrepancy  becomes  less  surprising  when  it  is  recalled  that  L0  is  the  max¬ 
imum  value  of  a  likelihood  function,  whereas  a  LOGIST  parameter  estimate  gives 
the  location  of  a  point  at  which  the  maximum  is  obtained.  If  the  likelihood 
function  is  relatively  flat,  then  the  maximizing  values  of  the  arguments  of  the 
function  will  be  difficult  to  determine  precisely  because  a  small  parameter 
change  results  in  a  very  small  likelihood  change.  Somewhat  paradoxically,  flat¬ 
ness  of  the  likelihood  function  can  simplify  the  problem  of  calculating  L0,  the 
value  of  the  function  near  its  maximum.  The  value  of  the  likelihood  function 
will  be  almost  constant  for  parameter  values  in  the  neighborhood  of  the  maximiz¬ 
ing  value. 

2L0GIST  version  2.B  was  used  in  the  present  research.  LOGIST  has  since  been 
modified. 
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Figure  2 

Bivariate  Scatterplot  of  L0  Computed  from  Exact  and 
Estimated  (from  NORMAL  2800  Data)  Item  Parameters 
for  the  First  100  Simulated  Examinees  in  NORMAL  2800  Data 


l.  Compuf  <*<i  f  ton  h'x;ict  lti»n  Paramo tors 


A  possible  artifact  complicating  the  interpretation  of  these  results  is 
evidenced  by  consideration  of  an  analogy  from  multiple  regression.  The  ROC  will 
be  high  to  the  degree  that  the  estimated  parameters  fit  NORMAL  2800  better  than 
LOW  102.  NORMAL  2800  can  be  considered  analogous  to  a  multiple  regression  de¬ 
rivation  sample  and  LOW  102  to  a  cross-validation  sample.  At  least  with  small 
samples,  overfitting  is  expected  in  the  derivation  sample;  and  shrinkage,  in  the 
cross-validation  sample.  It  may  be  for  this  reason  that  L0 ,  as  a  measure  of 
goodness  of  fit,  tends  to  be  smaller  in  LOW  102.  That  estimated  parameter  aber¬ 
rance  scores  for  the  most  aberrant  LOW  102  examinees  are  lower  than  exact  param¬ 
eter  aberrance  scores  supports  the  suspicion  of  overfitting.  However,  any  over¬ 
fitting  should  result  in  relatively  high  NORMAL  2800  scores  (i.e.,  many  points 
above  the  diagonal  in  Figure  2)  and  this  was  not  found.  Furthermore,  the  dis¬ 
crepancy  between  estimated  and  exact  parameter  index  values  is  so  small  that 
abcissa  values  on  the  ROC  would  be  changed  less  than  .0004  if  the  lowest  LOW  102 
index  values  fell  exactly  on  the  diagonal  in  Figure  1. 

If  overfitting  were  a  significant  artifact,  then  poor  detection  would  be 
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Figure  3 

ROC  Describing  Detection  of  LOW  102  Response  Vectors 
by  L0  Index  Computed  from  Item  Parameters  Estimated  in  NORMAL  2800  Data 


Proportion  of  Normal  2800  Response  Vectors  Misc lass  if  led 


expected  (1)  if  the  normal  group  used  to  evaluate  an  index  was  distinct  from  the 
norming  group  utilized  for  item  parameter  estimation  or  (2)  if  the  aberrant  and 
normal  groups  were  pooled  to  form  the  norming  sample.  In  Study  4  and  in  Drasgow 
(1978)  the  norming  and  normal  groups  were  distinct.  In  the  next  study  the  aber¬ 
rant  and  normal  groups  were  combined  to  form  a  single  group  used  to  estimate 
item  parameters. 

Study  2:  Heterogeneous  Norming  Sample — 

Classification  and  Norming  Sample  Equal 

Problem 


In  a  simulation  study  all  aberrant  answer  sheets  are  clearly  identified, 
since  they  are  generated  by  the  experimenter.  In  an  actual  study  some  unde¬ 
tected  aberrants  are  likely  to  be  included  in  the  test  norming  sample.  Will 
unsuspected  aberrants  in  the  norming  sample  seriously  degrade  item  parameter 
estimates  and  undermine  the  person  measurement  stage  of  appropriateness  measure¬ 
ment? 
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A  secondary  problem  is  related  to  the  possible  overfitting  and  shrinkage 
noted  in  Study  1.  Individual  simulation  examinees  are  expendable.  They  can  be 
used  for  test  norming  and  ignored  thereafter  because  new  statistically  equiva¬ 
lent  answer  sheets  can  easily  be  generated  for  use  in  the  person  measurement 
stage.  However,  in  actual  studies  sample  sizes  are  fixed,  and  it  generally  will 
be  important  to  evaluate  appropriateness  for  the  examinees  in  the  norming  sam¬ 
ple.  Will  current  estimation  procedures  overfit  aberrant  examinees  in  the  norm¬ 
ing  sample,  or  will  it  be  possible  to  use  an  appropriateness  index  to  identify 
norming  sample  aberrants? 

Method 


NORMAL  2800  and  LOW  200  were  merged  to  form  a  data  file  with  a  large  pro¬ 
portion  of  aberrant  examinees.  Item  parameters  were  estimated  using  all  3,000 
simulated  examinees.  As  before,  Lo  was  computed  for  all  examinees  by  evaluating 
the  likelihood  function  at  the  L0GIST  maximum  likelihood  estimate  of  ability. 

New  index  values  for  LOW  102  were  compared  with  exact  parameter  index  values, 
and  the  Lo  ROC  was  computed  to  evaluate  detectability. 

Results  and  Discussion 


Figure  4  shows  that  estimating  item  parameters  in  a  large  sample  with  a 
large  proportion  of  spuriously  low  examinees  need  not  noticeably  degrade  appro¬ 
priateness  measurement.  The  simulation  parameter  L0  ROC  had  hit  rates  of  11.7%, 
21.6%,  and  47.1%  at  false  alarm  rates  of  .1%,  .9%,  and  4.8%.  The  corresponding 
heterogeneous  norming  sample  false  alarm  rates  were  .1%,  1.2%,  and  4.9%.  Clear¬ 
ly,  the  net  effect  on  appropriateness  measurement  of  estimating  item  parameters 
from  this  heterogeneous  sample  is  negligible. 

Figure  5  contains  the  bivariate  scatterplot  of  exact  parameter  L0  values 
plotted  against  L0  values  computed  from  item  parameters  estimated  in  the  hetero¬ 
geneous  sample.  The  relatively  high  frequency  of  points  above  the  diagonal  in 
Figure  5  indicates  an  overfitting  effect  for  aberrant  examinees;  however,  both 
Figures  4  and  5  support  the  conclusion  that  the  effect  is  small.  Figure  5  does 
so  because  all  points  are  tightly  clustered  about  the  diagonal,  i.e.,  there  is 
little  difference  between  exact  parameter  Lo  values  and  heterogeneous  sample 
estimated  parameter  L0  values.  The  high  detectability  exhibited  in  the  ROC  sup¬ 
ports  the  contention  that  overfitting  is  small,  because  a  large  overfitting  ef¬ 
fect  would  have  reduced  normal-aberrant  group  differences  and  therefore  reduced 
detectability  of  aberrance. 

These  results  on  overfitting  should  be  interpreted  cautiously.  Parameters 
were  estimated  in  a  very  large  sample  (N  =  3,000).  Further,  the  nature  of  spu¬ 
riously  low  aberrance  may  be  essential  to  the  small  effect.  The  distinction 
between  bias  and  sampling  error  is  useful  in  understanding  this  point.  The  pro¬ 
cess  hypothesized  to  underlie  spuriously  low  aberrance  is  essentially  unsystem¬ 
atic;  that  is,  atypical  schooling,  alignment  errors,  and  exceptional  creativity 
lead  to  incorrect  responses  on  different  examination  items.  Thus,  the  different 
examinees  will  tend  to  have  competing  effects  on  item  parameter  estimates.  Con¬ 
sequently,  the  presence  of  aberrance  in  the  norming  group  should  affect  the  sam¬ 
pling  error  of  item  parameter  estimates  to  some  extent  but  should  have  a  rela- 


Figure  4 

ROC  Describing  Detection  of  LOW  102  Response  Vectors 
by  the  L0  Index  Computed  from  Item  Parameters  Estimated  in 
NORMAL  2800  and  LOW  200  Data 
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tively  small  effect  on  the  bias  of  the  estimates.  In  addition,  the  effect  on 
the  sampling  error  is  tolerable  due  to  the  large  sample  size. 


Study  3:  Actual  SAT-V  Data — 


Problem 

The  one-dimensional  3-parameter  logistic  model  is  the  most  general  model 
for  which  there  is  a  well-developed  parameter  estimation  literature.  It  is  easy 
to  formulate  more  plausible  refinements  and  generalizations  of  this  model.  How¬ 
ever,  their  use  would  require  a  long  and  costly  research  program  to  develop  and 
validate  parameter  estimation  methods.  Is  the  logistic  model  sufficiently  de¬ 
scriptive  to  detect  spuriously  low  examinees  in  actual  test  data? 

Method 

Three  thousand  "low  omitting  rate"  examinees  were  sampled  from  an  admin¬ 
istration  of  the  SAT-V.  All  3,000  examinees  responded  to  at  least  90%  of  the 
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Figure  5 

Bivariate  Scatterplot  of  L0  computed  from  Exact  and 
Estimated  (from  NORMAL  2800  and  LOW  200  Data)  Item  Parameters 
for  LOW  102  Response  Vectors 


1.0  Computed  from  Exact  Item  Parameters 


test  items.  LOGIST  was  used  to  estimate  item  parameters  from  these  3,000  nomi¬ 
nally  normal  examinees.  A  file  of  200  aberrant  examinees  was  then  created  by 
applying  the  20%  spuriously  low  modification  to  answer  sheets  from  examinees 
2,801,  2,802, ...,3,000. 

L0  was  computed  by  maximizing  the  individual  likelihood  functions  for  the 
3-parameter  logistic  model.  In  this  calculation  the  LOGIST-est imated  item  pa¬ 
rameters  were  held  constant  and  the  likelihood  of  the  individual's  response  vec¬ 
tor  was  maximized  by  selecting  an  optimal  ability  estimate.  The  ability  esti¬ 
mates  are  slightly  different  from  the  LOGIST  ability  estimates  because  the  pro¬ 
grams  used  in  this  study  ignored  both  omitted  and  not  reached  items.  (The  LO¬ 
GIST  procedure  ignores  not  reached  items  but  attempts  to  use  the  omitted  items 
by  a  technique  that  can  be  thought  of  as  giving  partial  credit  for  omitted 
items . ) 

The  person  parameters  of  the  Gaussian  model  were  estimated  by  a  version  of 
the  Fletcher-Powell  algorithm  (Gruvaeus  &  Joreskog,  1970).  In  this  calculation 
each  examinee's  response  vector  is  considered  in  isolation;  and  the  likelihood 
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function  is  maximized  to  estimate  central  ability,  0O  ,  and  ability  variance,  0  . 
As  before,  omitted  and  not  reached  items  are  ignored. 

Results 

Figures  6  and  7  present  the  ROCs  for  the  L0  and  LR  appropriateness  indices, 
respectively.  It  is  apparent  that  high  detection  rates  are  obtained  at  low 
false  alarm  rates.  In  particular,  there  are  .2,  .8,  and  3.3%  false  alarms  at 
hit  rates  of  10%,  20%,  and  40%  for  the  L0  index;  and  the  LR  index  yields  .04%, 
.8%,  and  4.3%  false  alarms  at  the  same  hit  rates.  These  results  are  even  more 
impressive  when  it  is  noted  that  the  aberrant  group  consists  of  all  200  spuri¬ 
ously  low  examinees;  no  spuriously  low  examinee  was  deleted  from  the  analysis 
due  to  an  insufficient  change  in  item  scoring.  In  fact,  42  examinees  had  seven 
or  fewer  item  responses  changed  (from  correct  to  incorrect  and  from  incorrect  to 
correct)  when  subjected  to  the  spuriously  low  modification. 


Figure  6 

ROC  Describing  Detection  of  Spuriously  Low  Response 
Vectors  by  L0  Index  for  Actual  SAT-V  Data 


n.2  o.4  n.f-  n.8  i.o 

Proport  ion  of  Normal  Response  Vectors  Misc  1  assi  f  ied 


Discussion 

It  might  be  argued  that  the  ROC  in  Figures  6  and  7  capitalize  on  statisti¬ 
cally  overfitted  data  from  the  normal  examinees.  This  argument  rests  on  the 
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fact  that  the  2,800  examinees  constituting  the  normal  sample  were  included  in 
the  norming  sample,  whereas  item  responses  from  the  aberrant  examinees  were  in¬ 
cluded  in  the  norming  sample  prior  to  the  spuriously  low  modification.  That  is, 
the  post-tampering  response  vectors  were  not  included  in  the  norming  sample  for 
the  spuriously  low  group.  It  is  suspected  that  the  overfitting  problem  is  not 
serious  for  two  reasons.  First,  as  seen  in  Studies  1  and  2,  overfitting  did  not 
create  serious  difficulties  for  appropriateness  measurement  with  spurious  low¬ 
ness  and  for  item  parameters  that  are  estimated  in  a  large  sample.  Second,  the 
majority  of  the  item  responses  made  by  spuriously  low  examinees  did  in  fact  con¬ 
tribute  to  item  parameter  estimation:  No  more  than  20  of  the  85  items  were  mod¬ 
ified.  Consequently,  overfitting  had  very  little,  if  any,  effect  on  the  detec¬ 
tion  of  aberrance  in  the  actual  data.  However,  it  seemed  wise  to  attack  the 
iverfitting  problem  directly  in  Study  4  by  separating  the  test  norming  and 
index-evaluation  classification  samples. 

Figure  7 

ROC  Describing  Detection  of  Spuriously  Low  Response 
Vectors  by  LR  Index  for  Actual  SAT-V  Data 


It  is  interesting  to  compare  the  effectiveness  of  the  L0  and  LR  indices  in 
detecting  aberrance.  Clearly,  in  Figures  6  and  7  the  L0  index  is  superior  to 
the  LR  index  at  high  false  alarm  rates;  but  it  is  expected  that  there  are  many 
more  normal  examinees  than  aberrant  examinees.  Hence,  the  base  rate  difference 
causes  the  discrepancy  to  be  unimportant  between  the  L0  and  LR  ROCs  at  moderate 
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to  high  false  alarm  rates;  index  performance  is  crucial  only  at  very  low  false 
alarm  rates.  The  LR  index  detected  19  aberrant  examinees  (out  of  200)  without 
misclassifying  a  single  normal  examinee,  whereas  Lo  detected  only  6  without  mis- 
classification.  At  a  false  alarm  rate  of  . 5 %,  LR  detected  37  aberrant  examinees 
and  L0  detected  32.  Thus,  at  very  low  false  alarm  rates  the  LR  index  seems 
somewhat  more  powerful. 


Study  4:  Actual  GRE-V  Data — 
Distinct  Norming  and  Classification  Samples 


Problem 


The  positive  results  in  Study  3  may  be  attributable  in  large  part  to  over¬ 
fitting  made  possible  by  overlapping  norming  and  classification  samples.  In 
this  study  the  samples  were  independent.  In  addition.  Study  4  investigated  the 
generality  of  the  appropriateness  indices.  The  indices  used  in  Studies  1 
through  3  were  selected  by  Levine  and  Rubin  (in  press)  from  a  larger  collection 
of  indices  because  of  their  superior  performance  in  experimental  studies  with 
actual  and  simulated  SAT-V  data.  The  question  is,  are  these  methods  applicable 
to  other  tests? 

Method 


The  responses  to  the  Verbal  Section  of  the  Graduate  Record  Examination 
(GRE-V)  by  10,000  examinees  were  utilized  in  the  following  manner.  First,  a 
file  of  3,000  examinees  (FILE1)  with  a  wide  range  of  ability  and  unrestricted 
omitting  was  created  by  selecting  Examinees  1,  2,  3,  11,  12,  13,  ...,  9991, 

9992,  9993.  This  data  set  was  then  analyzed  by  LOGIST  to  obtain  item  parameter 
estimates.  A  second  file  was  created  by  examining  the  item  responses  of  the 
remaining  7,000  examinees  and  selecting  examinees  with  a  low  omitting  rate.  A 
total  of  2,470  examinees  who  had  answered  at  least  86  of  the  95  GRE-V  items  was 
obtained.  Two  hundred  of  these  examinees  were  subjected  to  the  20%  spuriously 
low  modification.  These  modified  response  vectors  formed  the  aberrant  group  for 
Study  4,  and  the  remaining  2,270  response  vectors  formed  the  normal  group. 

L0  was  computed  for  the  200  aberrant  and  2,270  normal  response  vectors  as 
in  Study  3  using  the  FILE1  item  parameter  estimates.  Notice  that  here  the  test 
norming  sample  was  distinct  from  the  normal  and  aberrant  samples. 

Results 


Figure  8  presents  the  ROC  for  the  Lj  index.  Clearly,  detection  rates  are 
quite  high.  At  hit  rates  of  10%,  20%,  and  40%  there  are  .3%,  1.4%,  and  3.7% 
false  alarms.  These  detection  rates  were  not  substantially  different  from  those 
obtained  in  Study  3. 

Discussion 


The  results  of  Study  4  are  important  for  two  reasons.  First,  the  criticism 
of  overfitting  the  normal  sample,  which  could  be  made  for  Studies  1  and  3,  is 
not  relevant  for  Study  4.  Because  the  test  norming  sample  (FILE1)  was  composed 
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Figure  8 

ROC  Describing  Detection  of  Spuriously  Low  Response 
Vectors  by  L0  Index  for  Actual  GRE-V  Data 
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of  examinees  not  included  in  the  normal  and  aberrant  groups,  no  differential 
statistical  overfitting  of  the  data  for  the  normal  and  aberrant  groups  was  pos¬ 
sible.  Consequently,  the  results  of  Study  4  can  be  interpreted  unambiguously. 


The  finding  that  appropriateness  measurement  is  effective  for  GRE-V  data  is 
important  for  a  second  reason.  Levine  and  Rubin  (in  press)  originally  consid¬ 
ered  a  variety  of  appropriateness  indices.  After  examining  the  effectiveness  of 
each  index  using  simulated  SAT-V  data,  Levine  and  Rubin  selected  the  most  effec¬ 
tive  indices  for  further  study.  The  extent  to  which  the  most  effective  indices 
were  capitalizing  on  test  characteristics  unique  to  the  SAT-V  was  unknown. 

Study  4  shows  that  the  methods  developed  using  SAT-V  data  were  sufficiently  gen¬ 
eral  to  be  implemented  for  GRE-V  data. 


Summary  and  Conclusion 


These  studies  have  shown  that  some  appropriateness  measurement  techniques 
are  robust  to  errors  in  estimation  of  item  parameters,  to  the  inclusion  of  un¬ 
identified  aberrants  in  the  test  norming  sample,  and  to  violations  of  the  3-pa¬ 
rameter  logistic  model,  which  surely  must  exist  in  actual  data.  The  detection 
rate  of  spuriously  low  examinees  was  high  in  all  the  analyses  undertaken. 
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Model  Validity  and  Variable  Ability  Models 

The  3-parameter  logistic  model,  with  its  local  independence  and  unidimen¬ 
sionality  assumptions,  is  admittedly  simplistic.  Lord  (1975b)  has  brought  to 
the  attention  of  the  authors  a  very  likely  violation  of  local  independence  on 
the  SAT-V  and  GRE-V.  Some  paragraphs  associated  with  several  items  on  the  read¬ 
ing  comprehension  part  of  the  examinations  may  be  misunderstood  or,  alternative¬ 
ly,  may  be  relevant  to  an  area  in  which  the  examinee  has  an  unusually  strong' 
background.  Thus,  it  seems  virtually  certain  that  responses  to  items  referring 
to  the  same  passage  will  be  more  highly  interrelated  than  the  local  independence 
assumption  (Equation  1)  predicts.  However,  in  spite  of  its  shortcomings,  the 
standard  model  has  been  able  to  describe  regularities  in  data  well  enough  to 
support  appropriateness  measurement.  A  more  valid  model,  presumably,  could  sup¬ 
port  even  more  powerful  appropriateness  indices. 

Parenthetically,  it  is  noted  that  violations  of  local  independence  can  be 
accommodated  by  variable  ability  generalizations  of  the  standard  latent  trait 
model.  A  variable  ability  model  is  now  being  considered  for  dealing  with  co¬ 
varying  blocks  of  items  (such  as  those  called  to  the  authors’  attention  by  Lord, 
1975b),  the  blocks  model,  in  which  a  test  is  analyzed  into  interrelated  blocks 
of  items.  The  examinee ’ s  ability  on  an  item  in  a  particular  block  is  his  or  her 
central  ability,  0O,  plus  a  (normally)  distributed  correction.  This  correction 
is  constant  throughout  the  block  of  items.  The  Gaussian  model  is  a  limiting 
special  case  in  which  each  item  forms  a  one-item  block.  The  standard  model  is 
another  limiting  case  in  which  the  entire  test  forms  one  block.  If  an  adequate 
item  parameter  estimation  procedure  were  developed  for  the  blocks  model,  a  sub¬ 
stantial  improvement  in  appropriateness  measurement  could  be  achieved. 

The  variable  ability  models  are  related  to  the  independent  work  of  Lumsden 
(1978).  Lumsden  used  "person  characteristic  curves"  (PCCs)  to  describe  fluctua¬ 
tions  in  ability.  The  authors’  work  with  the  Gaussian  model  appears  to  be  a 
quantitative  step  in  the  direction  Lumsden  recommends  for  his  purposes. 

Item  Parameters  versus  Conditional  Probabilities 


In  appropriateness  measurement  and  in  many  other  applications  of  latent 
trait  models,  even  very  large  standard  errors  of  item  estimates  can  sometimes  be 
tolerated.  This  is  true  because  the  probability  measure  determined  by  a  latent 
trait  model  depends  directly  on  the  conditional  probability  functions  or  item 
characteristic  curves  (ICCs),  P^,  and  only  indirectly  on  the  item  parameters. 

The  item  parameters  are  simply  a  convenient  way  to  encode  the  shape  of  ICCs.  In 
fact,  some  curves  are  adequatey  described  by  a  broad  range  of  parameters. 

This  point  is  developed  in  a  recent  study  of  item  bias  (Linn,  Levine,  Has¬ 
tings,  &  Wardrop,  1979).  In  that  study,  item  parameters  were  estimated  for  a 
reading  achievement  test  from  two  groups  with  widely  different  distributions  of 
achievement  scores.  A  remarkable  degree  of  invariance  was  observed  when  the 
estimated  curves  from  the  two  groups  were  compared.  The  two  sets  of  estimated 
curves  were  generally  very  nearly  the  same,  although  a  superficial  comparison  of 
parameters  often  showed  large  differences  in  the  estimated  item  parameters. 
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Estimation  of  ICCs  Having  More  Than  One  Parameter 

In  spite  of  several  monte  carlo  studies  and  numerous  successful  applica¬ 
tions  of  the  3-parameter  logistic  model,  there  seems  to  be  some  doubt  about 
whether  the  3-parameter  ICCs  can  be  estimated  by  currently  used  programs  or  by 
any  method  whatsoever.  Some  psychometricians  evidently  believe  that  adequate 
parameter  estimates  can  be  obtained  only  with  the  1-parameter,  or  Rasch,  model 
(i.e. ,  the  specialization  of  the  3-parameter  model  obtained  by  setting  a,1  -  1 
and  C£  *  0  in  Equation  3). 

In  fact,  parameter  estimation  techniques  are  available  for  models  with  much 
more  complex  ICC  shapes  than  that  of  the  3-parameter  logistic  model.  Lord 
(1970)  and  Samejima  (1977)  have  formulated,  programmed,  and  applied  nonparamet- 
ric  curve  estimation  techniques  suitable  for  estimating  curves  of  arbitrary 
shape.  Furthermore,  Levine  (1976)  has  proved  a  consistency  result  for  an  esti¬ 
mation  technique  for  estimating  points  on  curves  of  arbitrary  shape.  It  seems 
that  the  estimation  difficulties  ensuing  from  the  departure  from  very  simple 
curve  shapes  have  been  exaggerated. 

Work  in  Progress 

The  next  step  in  the  development  of  appropriateness  measurement  will  be  to 
develop  techniques  for  conventional  tests  in  which  there  is  substantial  omit¬ 
ting.  The  research  in  this  paper  with  actual  data  has  been  restricted  to  answer 
sheets  with  90%  or  higher  response  rates.  However,  a  substantial  proportion  of 
the  examinees  have  omitting  rates  greater  than  10%  on  the  SAT-V  and  GRE-V. 

It  has  been  found  that  there  is  an  orderly  relationship  between  omitting 
and  ability  on  many  items  (Levine,  Drasgow,  &  Rubin,  in  prep.)  on  conventional 
tests;  and  latent  trait  models  for  polychotomously  scored  items  are  being  devel¬ 
oped  to  exploit  this  relationship.  By  treating  "omitted"  and  "not  reached"  as 
option  choices,  each  answer  sheet  can  be  analyzed  as  if  every  item  were  an¬ 
swered.  The  preliminary  investigations  indicate  that  in  this  way  both  the  range 
of  applicability  and  the  statistical  power  of  appropriateness  measurement  can  be 
significantly  increased. 
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Discussion:  Session  7 


James  Lumsden 

University  of  Western  Australia 


The  three  papers— -by  Mead,  by  Wainer  and  Wright,  and  by  Levine  and 
Drasgow — agree  that  the  task  of  estimating  ability  is  a  matter  of  estimating  a 
parameter  of  the  person  characteristic  curve  (PCC),  which  was  first  suggested  by 
Mosier  (1940,  1941).  It  was  rediscovered  independently  by  Weiss  (1973)  and  by 
me  (Lumsden,  1976),  but  not  independently  because  1  had  read  the  Mosier  paper 
and  had  "forgotten"  about  it.  The  PCC  is  the  plot  of  proportion  passed  against 
item  difficulty  for  a  single  individual. 

Each  of  the  papers  assumes — but  Levine  and  Drasgow' s  not  quite  so  complete¬ 
ly — that  all  we  need  do  (in  a  proper  world)  is  to  estimate  the  location  parame¬ 
ters.  All  the  PCCs  "ought"  to  have  the  same  slope;  and  departures  from  the 
ideal  represent  aberrances,  perturbations,  warts  on  the  face  of  the  estimate  of  a 
person's  ability. 

Mead  has  proposed  to  apply  vanishing  cream  to  the  warts,  at  least  to  the 
warts  he  sees.  He  has  eliminated  responses  considered  aberrant  and  has  estimat¬ 
ed  ability  from  those  remaining.  Wainer  and  Wright  have  made  some  comments  on 
the  inefficiency  of  this  pruning  procedure  and  have  suggested  a  rather  more 
suitable  one.  What  I  am  concerned  about  is  the  suggestion  that  if  the  Mead  pro¬ 
cedure  (or  the  similar  dropping  of  aberrant  subjects)  is  carried  out,  the  Rasch 
model  has  been  fitted.  In  no  sense  has  this  been  done. 

What  about  the  possibility  of  a  Type  2  error?  Consider  the  following  sim¬ 
ple,  but  quite  plausible,  example.  Two  people  each  know  the  answer  to  only  one 
item  of  a  seven-item  test.  They  are  each  somewhat  lucky  and  guess  the  answer  to 
three  others.  They  produce  the  following  response  vectors,  with  items  ordered 
in  difficulty  from  left  to  right: 

A  1  0  0  0  l  1  1 
B  11110  0  0 

The  Mead  procedure  would  eliminate  for  Subject  A  the  final  three  correct 
answers  and  score  only  the  first  correct.  For  Subject  B  the  score  would  be  es¬ 
timated  as  if  B  knew  the  answer  to  four  items.  I  submit  that  each  of  these  sub¬ 
jects  equally  fits  or  does  not  fit  the  Rasch  model. 

The  suggestion  that  a  model  can  be  fitted  by  removing  the  data  that  do  not 
fit  it  is  nonsense,  dangerous  nonsense.  Brown  and  Stephenson  (1932)  did  this 
when  they  claimed  a  fit  for  the  Spearman  two-factor  theory.  Wolfle  (1940)  com¬ 
mented,  "if  you  remove  all  the  variables  that  do  not  meet  the  tetrad-difference 
criterion,  those  that  are  left  do  meet  it." 
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Wainer  and  Wright  have  proposed  to  shave  their  subjects  with  a  jackknife 
and  thus  clip  the  warts.  The  general  proposal  1  find  attractive.  It  gives  less 
weight  to  outliers  but  does  not  eliminate  them:  All  the  subjects  and  all  the 
results  are  considered.  In  this  it  is  rather  like  the  work  of  the  psychophysi¬ 
cists  who  use  the  Mueller-Urban  weights. 

Wainer  and  Wright's  results  should  be  taken  with  a  square  root  of  salt. 

When  someone  talks  of  estimating  a  point  value  such  as  a  person's  ability,  I 
think  of  confidence  limits  and  standard  errors  rather  than  variances.  If  the 
square  root  of  all  of  Wainer  and  Wright's  results  are  taken,  the  differences 
between  methods  are  seen  as  very  much  less;  and  in  some  cases,  they  can  be  de¬ 
scribed  as  negligible.  It  should  be  noted,  too,  that  the  method  of  reporting 
only  efficiency  ratios  suppresses  the  main  effect  of  test  length.  What  is  the 
value  of  all  this  arithmetic?  One  item,  or  two  or  three? 

I  would  also  like  to  see  the  bias  and  precision  estimates  reported  sepa¬ 
rately.  It  should  be  noted  that  the  efficiency  comparisons  are  critically  de¬ 
pendent  on  the  selection  of  the  a.  value  for  the  items.  If  the  a  value  is  in¬ 
creased,  the  efficiency  differences  will  be  less. 

Wainer  and  Wright's  treatment  of  omitted  responses  is  distressingly  fool¬ 
ish.  They  permitted  subjects  to  omit  items  and  did  not  punish  them  in  any  way 
but  simply  scored  the  other  items  as  if  they  comprised  the  entire  test.  Now,  if 
there  are  any  person-item  interactions,  the  person  smart  enough  (or  lazy  enough) 
to  omit  those  items  whose  answer  does  not  come  to  him/her  immediately  will  be 
given  a  higher  score  than  the  person  who  attempts  all  questions.  This  is  a  se¬ 
riously  biasing  part  of  Wainer  and  Wright's  procedure.  One  can  usually  be  cer¬ 
tain  that  a  person  who  omits  does  not  know  the  correct  answer.  He/she  should 
either  be  counted  as  wrong  or  should  be  given  the  chance  expectation  of  being 
correct. 

Both  the  Mead  paper  and  the  Wainer  and  Wright  paper  approach  the  problem  of 
fitting  the  Rasch  model  with  what  I  term  the  psychoarithmetician's  fallacy: 
that  arithmetic  can  be  substituted  for  experimental  control.  Why  not  attempt  to 
meet  the  strict  requirements  of  the  Rasch  model?  One  could  begin  by  eliminating 
the  problem  of  guessing  by  using  only  completion-type  items.  If  this  means  that 
some  tests  have  to  be  hand-scored,  so  much  the  better.  One  should  then  attempt 
to  construct  a  strictly  unidimensional  test,  keeping  the  test  construction  com¬ 
pletely  independent  of  the  test  scoring  and  application  phases. 

Levine  and  Drasgow  used  the  3-parameter  model,  also  seeming  to  believe  that 
arithmetic  is  preferable  to  the  "restrictions"  of  experimental  control  and 
agreeing  that  perturbations  of  the  PCC  are  aberrances,  warts.  Shuddering,  they 
dismissed  the  warty  ones  from  consideration.  Some  of  the  things  they  chose  to 
call  aberrant  seem  strange  to  me:  If  a  test  requires  a  subject  to  be  able  to 
read  and  to  understand  English,  in  what  way  is  it  aberrant  when  it  gives  a  low 
score  to  someone  who  cannot  read  and  understand  English?  The  positive  fea¬ 
ture  of  Levine  and  Drasgow's  paper  is  that  they  do  consider  Type  2  errors  and 
that  their  ROC  curve  procedure  is  ingenious  and  generalizable  to  a  variety  of 
other  situations. 
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The  assumption  underlying  all  three  papers — that  the  slopes  of  the  PCC 
"ought"  to  be  the  same — is  simply  that:  an  assumption.  If  it  is  agreed  that  it 
is  plausible  that  people  do  have  fluctuations  in  their  ability,  then  it  seems 
not  implausible  to  assume  that  people  may  differ  reliably  and  significantly  in 
the  extent  of  this  fluctuation.  There  is  evidence,  admittedly  not  overwhelming, 
from  Mosier  (1<'40,  1941)  and  Weiss  (1973)  that  they  do.  Further  evidence  should 
be  sought.  If  there  are  reliable  differences  in  the  slopes,  it  is  difficult  to 
see  how  these  differences  can  be  distinguished  from  the  aberrances  that  have  so 
distressed  our  participants. 
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Although  latent  trait  models  are  potentially  very  useful,  there  remain  many 
practical  problems  at  the  application  stage.  For  example,  how  should  a  latent 
trait  model  be  selected?  It  is  tempting  to  use  the  more  general  models,  since 
these  models  will  provide  the  "best"  fits  to  the  available  test  data.  Unfortu¬ 
nately,  the  more  general  latent  trait  models  (for  example,  the  3-parameter  lo¬ 
gistic  test  model)  require  more  computer  time  to  obtain  satisfactory  solutions, 
larger  samples  of  examinees,  and  longer  tests  and  are  more  difficult  for  practi¬ 
tioners  to  work  with.  Clearly,  more  needs  to  be  known  about  the  goodness  of  fit 
and  robustness  of  latent  trait  models.  Such  information  would  aid  practitioners 
in  the  important  step  of  selecting  a  test  model. 

There  has  been  some  research  on  the  goodness  of  fit  of  different  latent 
trait  models  to  a  variety  of  test  data  sets  (e.g..  Lord,  1975;  Tinsely  &  Dawis, 
1977;  Wright,  1968),  and  generally  the  results  have  been  good  (Hambleton, 
Swaminathan,  Cook,  Eignor,  &  Gifford,  1978).  However,  only  one  study  compared 
the  fit  of  more  than  one  latent  trait  model  to  the  same  data  sets  (Hambleton  & 
Traub,  1973).  In  that  study,  improvements  were  obtained  in  predicting  test 
score  distributions  (for  three  tests)  from  the  2-parameter  model  as  compared  to 
the  1— parameter  model. 

On  the  question  of  model  robustness  (i.e.,  the  extent  to  which  the  assump¬ 
tions  underlying  the  test  model  can  be  violated  to  a  greater  or  lesser  extent  by 
the  test  data  and  be  fitted  by  the  model),  the  results  of  several  studies  have 
been  reported  (Dinero  &  Haertel,  1977;  Hambleton,  1969;  Hambleton  &  Traub,  1976; 
Panchapakesan,  1969).  These  results  have  been  mixed,  perhaps  because  of  the 
confounding  of  results  with  sample  sizes. 

The  problem  with  most  of  the  goodness-of-fit  studies  and  the  robustness 
studies  reported  to  date  is  that  they  provide  no  indication  of  the  practical 
consequences  of  fitting  a  less  than  perfect  model  to  a  data  set.  It  really  is 
of  little  interest  to  the  practitioner  to  know  that  15  out  of  20  items  failed  to 
be  fitted  by  a  test  model  when  the  range  of  discrimination  parameters  reached  a 
value  of  .80.  If  the  size  of  the  examinee  sample  is  large  enough,  probably  all 
items  could  be  identified  by  a  chi-square  statistic  of  goodness  of  fit  as  not 
fitting  the  model.  On  the  other  hand,  if  the  size  of  the  examinee  sample  is 
small  enough,  perhaps  none  of  the  items  would  be  misfit  by  the  model.  Study  1 
addressed  this  question. 


One  of  the  features  of  using  any  latent  trait  model  is  the  possibility  of 
specifying  a  "target  information  curve"  and  then  selecting  test  items  from  an 
item  pool  to  produce  a  test  with  the  features  characterized  by  that  curve.  A 
target  information  curve  describes  the  desired  level  of  information  at  each 
point  on  the  ability  scale  underlying  examinee  test  performance.  Information, 
in  turn,  is  directly  related  to  the  degree  of  precision  of  ability  estimates  at 
different  points  on  the  ability  continuum.  In  fact,  as  long  as  a  test  is  not 
too  short,  the  standard  error  of  estimate  at  a  particular  ability  level  is  equal 
to  1  divided  by  the  square  root  of  information  provided  by  the  test  at  the  abil¬ 
ity  1  evel  in  question.  In  practice,  since  the  contribution  of  each  test  item  to 
the  test  information  curve  (referred  to  as  a  "score  information  curve"  when  item 
parameter  estimates  are  used  instead  of  the  item  parameter  values)  is  known— 
that  is,  once  the  item  parameter  values  or  the  item  parameter  estimates  are 
specified — it  is  possible  to  select  test  items  from  a  pool  of  calibrated  test 
items  (i.e.,  a  pool  of  test  items  with  associated  parameter  estimates)  to  pro¬ 
duce  a  score  information  curve  that  approximates  a  desired  target  information 
curve. 

One  of  the  problems  with  the  paradigm  offered  above  for  test  development  is 
the  imprecision  associated  with  the  item  parameter  estimates.  Score  information 
curves — and  therefore  the  associated  standard  errors  of  ability  estimates— will 
depend  on  the  precision  of  item  parameter  estimates.  In  turn,  precision  of  item 
parameter  estimates  is  influenced  by  the  examinee  sample  size  used  to  estimate 
the  item  parameters,  and  in  the  case  of  the  item  discrimination  parameter,  esti¬ 
mates  are  influenced  by  the  length  of  the  test.  Study  2  was  designed  to  address 
this  issue. 


STUDY  1 

The  purpose  of  Study  1  was  to  study  systematically  the  goodness  of  fit  of 
the  1-,  2-,  and  3-parameter  logistic  models.  Using  computer-simulated  test 
data,  the  effects  of  four  variables  were  studied:  (1)  variation  in  item  dis¬ 
crimination  parameters,  (2)  the  average  value  of  the  pseudo-chance-level  parame¬ 
ters,  (3)  test  length,  and  (4)  the  shape  of  the  ability  distribution.  Artifi¬ 
cial  or  simulated  data  representing  departures  of  varying  degrees  from  the  as¬ 
sumptions  of  the  3-parameter  logistic  test  model  were  generated  and  the  goodness 
of  fit  of  the  three  test  models  to  the  data  were  studied. 

Method 


Simulating  the  Test  Data 


The  simulation  of  item  response  data  for  examinees  was  accomplished  using 
the  3-parameter  logistic  model.  First,  the  number  of  examinees  (N),  shape  of 
the  ability  distribution,  and  values  of  the  ability  parameters  (0^  ■  1,  2,...,N) 


were  specified.  Next,  the  number  of  items  in  the  test  (n)  and  values  of  the 
three  item  parameters  ( a^ ,  >  8  “  1>  2,...,_n)  were  specified.  Then,  the 

examinee  and  item  parameters  were  substituted  in  the  equation  of  the  3-parameter 
logistic  model  to  obtain  p^  (0  <  <  1),  representing  the  probability  that 
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examinee  1  correctly  answered  item  j.  The  probabilities  were  arranged  in  a  ma¬ 
trix  P  of  order  N  x  n  whose  (1^,  j) element  was  _P^j  •  P  was  then  converted  into 

a  matrix  of  the  item  scores  for  examinees  (1  *  correct  answer,  0  =  incorrect 
answer)  by  comparing  each  with  a  random  number  obtained  from  a  uniform  dis¬ 
tribution  in  the  interval  0  to  1 .  If  the  random  number  was  less  than  or  equal 
to  (which  would  happen  on  the  average  jp^j  of  the  time),  was  set  equal  to 

1;  otherwise,  [Vy  was  set  to  0.  The  matrix  P  of  0's  and  l's  was  the  simulated 

test  data.  Three  statistics  used  in  estimating  examinee  ability  were  calculat¬ 
ed: 

n 

1-parameter  score,  E  u  ,  the  number-correct  score; 
g-1  8 


2-parameter  score,  E  a  u  ,  and 

g=l  «  8 


3-parameter  score,  £  w  (0)  u  » 
g=l  8  8 

corresponding  to  statistics  that  are  used  in  the  estimation  of  examinee  ability 
with  the  1-,  2-,  and  3-parameter  models,  respectively.  For  the  3-parameter  mod¬ 
el  statistic,  since  the  item  weights  [w  (0)]  depend  on  examinee  ability,  3-pa- 

- D 

rameter  model  estimates  of  ability  were  estimated  for  each  examinee  from  LOGIST 
(Wood,  Wingersky,  &  Lord,  1976). 

The  values  of  item  parameters  used  are  summarized  in  Table  1. 

Item  parameters.  Two  test  lengths  (20  and  40  items)  were  used  in  the  simu¬ 
lations.  Item  difficulty  parameters,  b,  were  selected  at  random  from  a  uniform 
distribution  in  the  interval  [-2,2].  An  analysis  of  the  difficulty  parameters 
reported  by  Lord  (1968)  suggested  that  this  decision  was  reasonable. 


The  discrimination  parameters,  a,  were  selected  at  rrndom  from  a  uniform 
distribution  with  mean  =  1.12.  The  range  of  the  discrimination  parameters  was  a 
variable  under  investigation.  The  range  was  varied  from  0.0  to  a  maximum  of 
1.24  [.50  to  1.74],  and  an  intermediate  value  of  .62  [.81  to  1.43]  was  also 
studied.  The  maximum  value  of  discrimination  was  similar  to  the  range  and  dis¬ 
tribution  of  the  discrimination  parameters  reported  for  the  Verbal  Section  of 
the  Scholastic  Aptitude  Test  (SAT-V;  Lord,  1968). 

The  extent  of  guessing  in  the  simulated  test  data  was  another  variable  un¬ 
der  study.  Two  values  of  the  average  guessing  parameter  were  considered:  c  « 

0.0  and  £=  .25.  All  pseudo-chance-level  parameters  were  set  equal  to  the  mean 
value  of  the  c  parameter  under  investigation. 


Examinee  parameters.  The  number  of  examinees  was  set  to  500.  This  number 
was  sufficient  to  produce  stable  goodness-of-f it  results.  Two  distributions  of 
ability  were  considered:  Uniform  [-2.5,  2.5]  and  Normal  [0,  1]. 


'i 
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Table  1 


Test  Lengths,  Range  of  Discrimination 
Parameters,  and  Pseudo-Chance  Level 
Parameters  for  Each  Data  Set 


Data 

Set 

Test 

Length 

Variation  in 

Di sc  rimination 
Parameters 

Pseudo-Chance 

Level 

Parameters 

A 

20 

0.00 

.00 

B 

20 

0.00 

.25 

C 

20 

.81  to  1.43 

.00 

D 

20 

.81  to  1.43 

.25 

E 

20 

.50  to  1.74 

.00 

F 

20 

.50  to  1.74 

.25 

G 

40 

0.00 

.00 

H 

40 

0.00 

.25 

I 

40 

.81  to  1.43 

.00 

J 

40 

.81  to  1.43 

.25 

K 

40 

.50  to  1.74 

.00 

L 

40 

.50  to  1.74 

.25 

Goodness  of  Fit 


For  each  data  set  A  through  L—2  test  lengths  x  2  levels  of  pseudo-chance 
parameters  x  3  levels  of  variation  in  discrimination  parameters — and  for  each  of 
the  two  ability  distributions — Uniform  and  Normal— three  scoring  methods  were 
used  to  estimate  ability  based  on  the  1-,  2-,  and  3-parameter  models.  Since 
simulated  data  were  used,  it  was  possible  to  "know"  examinee  ability  scores, 
which  served  as  the  criterion  against  which  to  judge  the  statistics  derived  from 
the  three  test  models  for  ranking  examinees.  The  rankings  of  examinees  derived 
from  each  model  (for  each  set  of  test  data)  were  then  compared  to  examinee 
"true"  abilities  using  Spearman  rank-difference  correlations  and  the  average 
discrepancy  in  ranks.  Because  of  the  arbitrariness  of  the  scale  on  which  0  is 

N 

measured,  summary  statistics  such  as  £  |0j  -  §j|/N  were  not  studied.  To  fur- 

i=l 

ther  facilitate  the  interpretation  of  results,  they  are  reported  separately  for 
each  half  of  the  ability  distribution  as  well  as  for  the  total  ability  distribu¬ 
tion. 


Results 

Results  are  summarized  in  Tables  2  through  5. 

Level  of  Variation  in  Discrimination  Parameters 

For  the  values  studied  in  the  paper,  using  discrimination  parameters  as 
item  weights  contributed  very  little  to  the  proper  ranking  of  examinees. 
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Table  2 

Spearman  Rank-Order  Correlations  (r)  and 
Average  Absolute  Difference  in  Rank  Orders  (AAD)  for 
the  Two  Halves  of  the  Uniform  Ability  Distribution 


Data 

Set 

True  vs 

.  1-P  Model 

True  vs. 

2-P  Model 

True  vs. 

3-P  Model 

r 

AAD 

r 

AAD 

r 

AAD 

Lower 

Half  (0  = 

-2.5  to  0.0) 

A 

.88 

54.24 

.88 

54.24 

.88 

54.24 

B 

.77 

76.61 

.77 

76.61 

•  83 

64.98 

C 

.88 

56.07 

.88 

56.41 

.88 

56.40 

D 

.76 

77.14 

.76 

76.90 

.83 

64.28 

E 

.87 

56.50 

.87 

56.56 

.87 

56.56 

F 

.75 

80.08 

.75 

79.92 

.83 

65.77 

G 

.94 

36.48 

.94 

36.48 

.94 

36.48 

H 

.87 

58.58 

.87 

58.58 

.91 

48.70 

I 

.95 

36.50 

.95 

36.47 

.95 

36.47 

J 

.87 

57.66 

.88 

56.86 

.91 

48.01 

K 

.94 

37.86 

.95 

36.96 

.95 

36.74 

L 

.87 

57.82 

.88 

56.87 

.91 

48.22 

Upper 

Half  (0  = 

0.0  to  +2.5) 

A 

.88 

54.45 

.88 

55.62 

.88 

55.62 

B 

.84 

63.68 

.83 

65.35 

.83 

65.73 

C 

.89 

52.23 

.88 

55.38 

.88 

55.38 

D 

.88 

63.80 

.83 

65.02 

.84 

63.19 

E 

.87 

56.99 

.88 

55.38 

.88 

55.47 

F 

.80 

71.57 

.80 

70.72 

.80 

69.16 

G 

.94 

39.03 

.94 

40.50 

.94 

40.50 

H 

.90 

50.19 

.90 

51.05 

.90 

50.85 

I 

.94 

40.65 

.93 

41.83 

.93 

41.85 

J 

.91 

49.14 

.90 

50.55 

.91 

50.27 

K 

.93 

40.79 

.94 

38.93 

.94 

38.94 

L 

.89 

52.88 

.89 

52.90 

.89 

52.68 

Level  of  Pseudo-Chance-Level  Parameters 


With  the  20-item  tests  the  3-parameter  model  was  considerably  more  effec¬ 
tive  at  ranking  examinees  correctly  in  the  lower  half  of  the  ability  distribu¬ 
tion.  Correlations  were  about  .08  higher  (about  .75  to  .83)  in  the  uniform  dis¬ 
tribution  of  ability  and  about  .08  higher  in  the  normal  distribution  (about  .65 
to  .73).  The  improvement  in  the  average  absolute  difference  in  rank  order  was 
about  13. 


With  the  40-item  tests,  the  3-parameter  model  was  also  somewhat  more  effec¬ 
tive  at  ranking  examinees  correctly  in  the  lower  half  of  the  ability  distribu¬ 
tion.  Correlations  were  about  .04  higher  in  both  ability  distributions.  The 
improvement  in  the  average  absolute  difference  in  rank  order  was  about  8.  The 
reduction  in  effectiveness  of  the  3-parameter  model  weights  was  to  be  expected 
with  the  longer  tests.  Gulliksen  (1950)  noted  the  insignificance  of  scoring 
weights  when  the  test  gets  longer  and  test  items  are  positively  correlated. 
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Table  3 

Spearman  Rank-Order  Correlations  (r)  and 
Average  Absolute  Difference  In  Rank  Order  (AAD)  for 
the  Full  Uniform  Ability  Distribution  (0  =  -2.5  to  +2.5) 


Data 

Set 

True  vs. 

1-P  Model 

True  vs. 

2-P  Model 

True  vs. 

3-P  Model 

r 

AAD 

r 

AAD 

r 

AAD 

A 

.97 

28.26 

.97 

28.37 

.97 

28.37 

B 

.93 

41.85 

.93 

41.97 

.95 

36.97 

C 

.97 

28.81 

.97 

29.14 

.97 

29.14 

D 

.93 

42.40 

.93 

43.93 

.94 

38.59 

E 

.97 

30.83 

.97 

30.14 

.97 

30.14 

F 

.93 

42.20 

.93 

42.73 

.94 

39.02 

G 

.98 

20.44 

.98 

20.61 

.98 

20.61 

H 

.96 

30.13 

.96 

30.26 

.97 

27.02 

I 

.98 

21.09 

.98 

21.25 

.98 

21.25 

J 

.96 

30.69 

.96 

30.75 

.97 

27.74 

K 

.98 

22.48 

.98 

21.81 

.98 

21.81 

L 

.96 

31.49 

.96 

30.50 

.97 

27.30 

For  examinees  in  the  upper  half  of  the  ability  distribution,  and  for  the 
data  sets  studied,  the  number-correct  score  was  about  as  effective  as  the  more 
complicated  scoring  weights  used  in  the  2-  and  3-parameter  models. 

Shape  of  the  Ability  Distribution 


As  expected,  correlations  tended  to  be  higher  for  the  uniformly  distributed 
ability  scores. 

Test  Length 


Increases  in  correlations  were  observed  due  to  doubling  the  length  of  the 
test.  Again,  as  expected,  they  tended  to  be  rather  small. 

Conclusions 


From  the  data  in  this  study,  it  is  clear  that  there  are  some  sizable  gains 
to  be  expected  in  the  correct  ordering  of  examinees  at  the  lower  end  of  the 
ability  continuum  with  modest  length  tests  (n.  •  20)  when  3-parameter  model  esti¬ 
mates  are  used  (as  opposed  to  the  number-correct  score).  The  gains  were  cut 
roughly  in  half  when  the  tests  were  doubled  (ii  ■  40)  in  length.  It  was  also 
surprising  that  item  discrimination  parameters  as  weights  had  so  little  effect 
on  the  results.  However,  Gulliksen  (1950)  summarized  the  research  on  item 
weights  nearly  30  years  ago  and  came  to  essentially  the  same  conclusion.  Conse¬ 
quently,  to  the  extent  that  these  simulated  data  sets  are  typical  of  real  data, 
it  would  appear  that  the  application  of  latent  trait  models  to  the  problem  of 
ranking  examinees  is  probably  not  worth  the  trouble  except  in  those  situations 
where  gains  of  the  size  noted  for  lower  ability  examinees  are  important.  The 
number  correct  score  ranks  examinees  nearly  as  well  as  the  most  complicated 
scoring  methods. 
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Table  4 


Spearman  Rank-Order  Correlations  (r)  and 
Average  Absolute  Difference  in  Rank  Order  (AAD)  for 
for  the  Two  Halves  of  the  Normal  Ability  Distribution 


Data 

Set 

True  vs.  1-P  Model 

True  vs. 

2-P  Model 

True  vs. 

3-P  Model 

r 

AAD 

r 

AAD 

r 

AAD 

Lower  Half  (0-0. 

0,  SD0  - 

1.0) 

A 

.82 

65.58 

.82 

65.58 

.82 

65.58 

B 

.65 

94.93 

.65 

94.93 

.74 

82.54 

C 

.84 

62.72 

.83 

63.26 

.83 

63.31 

D 

.65 

95.18 

.65 

95.77 

.73 

83.49 

G 

.80 

70.65 

.80 

69.43 

.80 

69.41 

F 

.66 

94.63 

.64 

95.80 

.73 

83.38 

G 

.91 

46.03 

.91 

46.03 

.91 

46.03 

H 

.81 

68.70 

.81 

68.70 

.85 

61.63 

I 

.90 

48.23 

.91 

47.28 

.91 

47.28 

J 

.81 

68.08 

.82 

67.05 

.85 

60.09 

K 

.90 

48.22 

.91 

46.58 

.91 

46.58 

L 

.81 

69.01 

.81 

68.66 

.85 

61.58 

Upper  Half  (0-0. 

0,  SD0  - 

1.0) 

A 

.84 

60.51 

.84 

60.81 

.84 

60.81 

B 

.76 

75.75 

.76 

76.16 

.77 

75.08 

C 

.85 

61.09 

.85 

61.60 

.85 

61.61 

D 

.76 

76.41 

.76 

78.02 

.77 

75.63 

E 

.83 

64.79 

.85 

63.08 

.85 

63.08 

F 

.75 

78.69 

.75 

79.92 

.77 

77.01 

G 

.90 

50.71 

.90 

50.75 

.90 

50.75 

H 

.82 

65.18 

.82 

65.45 

.83 

64.24 

I 

.89 

51.25 

.90 

50.21 

.90 

50.23 

J 

.82 

65.92 

.83 

64.84 

.84 

63.16 

K 

.89 

51.01 

.90 

49.95 

.90 

49.95 

L 

.81 

67.60 

.82 

64.51 

.83 

63.96 

The  results  of  this  single  study  should  be  generalized  with  caution,  since 
the  values  of  the  item  parameters  used  may  not  be  typical  of  real  data  sets. 
Secondly,  the  criterion  measure  of  goodness  of  fit  seems  suitable  for  the  situa¬ 
tion  in  which  a  user  desires  to  make  norm-referenced  interpretations  of  test 
scores.  There  are  many  other  test  situations  (for  example,  those  involving 
adaptive  tests,  test  score  equating,  and  criterion-referenced  tests)  where  a 
different  criterion  to  judge  the  quality  of  a  solution  would  be  more  suitable. 
Thirdly,  these  results  provide  a  somewhat  unfair  comparison  of  the  2-parameter 
model  with  the  other  two  models  because  the  item  discrimination  parameters  used 
in  the  weighting  process  to  derive  statistics  for  ability  estimation  would  have 
been  somewhat  different  had  the  "best-fitting”  2-parameter  curves  to  the  3-pa- 
rameter  item  characteristic  curves  been  used.  The  item  discrimination  parame¬ 
ters  in  the  best  fitting  2-parameter  curves  would  have  differed  somewhat  from 
those  defined  in  the  3-parameter  curves  to  which  they  were  fitted.  Finally,  the 
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Table  5 

Spearman  Rank-Order  Correlations  (r)  and 
Average  Absolute  Difference  In  Rank  Order  (AAD)  for 
the  Full  Normal  Ability  Distribution  (Xg  •  0.0,  SDq  ■  1.0) 


Data 

Set 

True  vs. 

1-P  Model 

True  vs. 

2-P  Model 

True  vs. 

3-P  Model 

r 

AAD 

r 

AAD 

r 

AAD 

A 

.94 

36.84 

.94 

36.91 

.94 

36.91 

B 

.88 

53.94 

.88 

53.90 

.91 

47.55 

C 

.94 

35.87 

.94 

35.99 

.94 

35.98 

D 

.88 

54.31 

.88 

54.34 

.91 

48.61 

E 

.93 

41.11 

.93 

40.96 

.93 

40.96 

F 

.87 

55.73 

.87 

57.94 

.88 

53.13 

G 

.97 

26.60 

.97 

26.62 

.97 

26.62 

H 

.95 

36.44 

.95 

36.46 

.96 

33.03 

I 

.97 

25.20 

.97 

25.54 

.97 

25.53 

J 

.94 

38.86 

.94 

37.65 

.95 

34.15 

K 

.97 

27.04 

.97 

25.88 

.97 

25.87 

L 

.94 

38.79 

.94 

37.33 

.95 

34.68 

correlation  results  of  the  1-parameter  model  and,  to  a  much  lesser  extent,  the 
2-parameter  model  are  inflated  (to  an  unknown  extent)  because  of  tied  scores. 
Therefore,  the  true  differences  In  the  reported  correlations  are  somewhat  larger 
than  those  reported  in  Tables  1  to  5. 


STUDY  2 

This  study  was  designed  to  investigate  two  practical  questions  that  are  of 
some  importance  and  interest  to  test  developers: 

1.  What  are  the  effects  of  examinee  sample  size  and  test  length  on  the 
standard  errors  of  ability  estimation  curves? 

2.  What  effects  do  the  statistical  characteristics  of  an  item  pool  have  on 
the  precision  of  standard  errors  of  ability  estimation  curves? 

Method 


Variables 


Tests  of  three  lengths  were  considered:  10,  20,  and  80  items.  Since  a  test 
with  10  items  is  about  as  short  a  test  as  is  used  in  practice,  the  10-test  item 
length  was  selected  to  be  studied;  and  the  80-item  test  was  selected  because  the 
length  represents  about  as  long  a  test  as  is  used  in  practice. 

Ability  scores  were  simulated  to  be  normally  distributed  (mean  -  0,  SD  - 
1).  This  assumption  was  made  to  conform  with  an  assumption  made  in  Urry's 
(1974)  item  parameter  estimation  method,  which  was  used  (with  slight  modifica¬ 
tions)  in  this  study. 


Three  examinee  sample  sizes  were  used:  50,  200,  and  1,000.  The  smallest 
sample  size  (N  =  50)  is  considerably  smaller  than  should  be  used  in  practice. 

It  was  chosen  to  identify  the  worst  possible  results  that  could  be  expected. 

The  other  two  sample  sizes  define  minimum  and  maximum  sample  sizes  typically 
used  in  test  development  work  with  latent  trait  models.  Ranges  of  parameter 
values  for  items  in  the  two  pools  are  shown  in  Table  6.  As  Table  6  shows,  items 
in  Item  Pool  1  had  a  wider  range  of  difficulty  and  discrimination  values  than 
those  in  Item  Pool  2. 


Table  6 

Range  of  Item  Parameter  Values  for  the 
Two  Simulated  Item  Pools 


Item 

Range 

of  Values 

Parameter 

Item 

Pool  1 

Item 

Pool  2 

Difficulty  (b) 

-2.00 

to 

2.00 

-1.00 

to 

1.00 

Discrimination  (a) 

.60 

to 

2.00 

.60 

to 

1.50 

Pseudo-Chance  (c) 

.25 

to 

.25 

.25 

to 

.25 

Data  Simulation 


The  eight  steps  in  the  data  simulation  were  as  follows: 

1.  Item  Pool  1  was  selected  for  study. 

2.  A  test  length  (10,  20,  or  80  items)  and  a  sample  size  (50,  200,  or 
1,000  examinees)  were  selected.  A  sample  of  examinee  ability  scores 
were  drawn  from  a  normal  distribution  (mean  ■  0,  SD  -  1). 

3.  Computer  program  DATAGEN  (Hambleton  &  Rovinelll,  1973),  produced  (1) 
item  parameters,  given  the  constraints  of  the  item  pool  under  investi¬ 
gation,  and  (2)  examinee  item  scores.  The  computer  program  used  the 
3-parameter  logistic  model,  the  ability  scores  from  Step  2,  and  item 
parameters  generated  at  this  step  to  produce  probabilities  of  correct 
answers  for  examinees  to  the  test  items.  These  probabilities,  in  turn, 
were  converted  to  examinee  item  scores  (0  or  1)  by  a  randan  number  gen¬ 
erator. 

4.  The  examinee  item  scores  from  Step  3  were  used  in  Urry's  computer  pro¬ 
gram  to  estimate  item  and  ability  parameters.  However,  only  the  item 
parameter  estimates  were  used  further  in  this  particular  study. 

5.  The  item  parameter  estimates  were  used  to  obtain  the  standard  errors  of 
estimate  for  estimating  6  [SEE  (6)].  The  values  of  SEE(6)  at  seven 
ability  levels  (9  -  -3.00,  -2.00,  -1.00,  0.00,  1.00,  2.00,  3.00)  were 
calculated . 

6.  Steps  3  to  5  were  repeated  three  times  to  obtain  three  estimates  of 
SEE(9).  All  item  and  ability  parameter  values  for  the  three  runs  were 


-  358  - 


identical.  The  particular  examinee  item  scores  varied  from  one  run  to 
the  next  because  of  the  probabilistic  nature  of  the  score  outcomes. 

7.  Steps  3  to  6  were  repeated  for  each  combination  of  test  length  and  sam- 
ple  size  (3x3*  9). 

8.  Steps  2  to  7  were  repeated  with  Item  Pool  2.  In  all,  54  sets  of  test 
data  were  considered  in  the  study. 

Results 


Tables  7  to  9  contain  the  SEE  curves  from  Item  Pool  1  obtained  for  three 
replications  of  three  examinee  sample  sizes  (N  »  50,  200,  and  1,000)  and  three 
test  lengths  (jn  ■  10,  20,  and  80)  and  for  seven  ability  levels.  Test  lengths 
and  sample  sizes  given  under  the  column  headed  "Actual”  are  the  number  of  items 
and  examinees  remaining  after  a  satisfactory  set  of  item  and  ability  parameter 
estimates  were  obtained  from  Urry's  computer  program. 

Effect  of  Sample  Size 


The  data  for  a  test  length  of  10  items,  shown  in  Table  7,  clearly  show  the 
lack  of  stability  of  the  SEE  curves  for  all  sample  sizes.  There  was  little  im¬ 
provement,  if  any,  due  to  increasing  sample  size.  This  result,  however,  may  be 
due  to  the  limited  amount  of  data  considered,  since  improvements  were  obtained 
in  Item  Pool  2  and  at  other  test  lengths. 

Table  7 


Standard  Error  Estimates  (SEE)  Adjusted  to  Correspond 
to  10-Item  Tests  for  Various  Sample  Sizes  and 
Ability  Levels  with  a  Heterogeneous  Item  Pool 


Sample  Size 
and 

Actual 

Test  Sample 

Ability  Level 

Replication 

Length 

Size 

-3.0 

-2.0 

-1.0 

0.0 

1.0 

2.0 

3.0 

50 

1 

10 

34 

.66 

.33 

.67 

.22 

.75 

1.60 

2.19 

2 

10 

34 

2.40 

1.88 

.56 

1.04 

.20 

1.34 

1.37 

3 

9 

34 

.73 

.57 

1.03 

.22 

.58 

.43 

2.19 

200 

1 

10 

172 

.64 

.21 

.52 

2.15 

1.60 

1.50 

1.48 

2 

10 

137 

.22 

.51 

.36 

1.30 

.37 

.96 

2.45 

3 

10 

174 

2.63 

2.14 

.27 

2.75 

.92 

.76 

1.91 

1000 

1 

10 

841 

.98 

.26 

.58 

1.43 

3.33 

.57 

1.18 

2 

10 

833 

1.03 

1.03 

.67 

1.05 

.45 

1.01 

1.06 

3 

10 

892 

2.44 

.49 

.67 

.30 

.29 

.89 

1.33 

Table  8  contains  the  results  for  20  item  test  lengths  and  shows  that  the 
SEE  curves  were  beginning  to  stabilize.  Except  at  extreme  values  of  the  ability 


continuum,  the  results  for  the  smaller  sample  sizes  were  nearly  as  good  as  those 
obtained  with  the  larger  sample  size  (N  »  1,000). 


Table  8 

Standard  Error  Estimates  (SEE)  for  Various  Sample  Sizes 
and  Ability  Levels  with  a  Heterogeneous  Item  Pool 


Sample  Size 
and 

Actual 

Test  Sample 

Ability  Level 

Replication 

Length 

Size 

-3.0 

-2.0 

1.0 

0.0 

1.0 

2.0 

3.0 

50 

1 

20 

50 

2.84 

.70 

.35 

.30 

.31 

.44 

1.23 

2 

20 

50 

1.93 

1.53 

.39 

.32 

.24 

.45 

1.19 

3 

20 

46 

2.07 

.83 

.58 

.31 

.36 

.68 

1.48 

200 

1 

20 

193 

— 

.57 

.26 

.39 

.33 

.50 

.77 

2 

20 

196 

— 

1.51 

.37 

.34 

.25 

.53 

.86 

3 

20 

196 

— 

1.03 

.22 

.49 

.34 

.40 

1.15 

1000 

1 

20 

955 

— 

1.05 

.48 

.33 

.33 

.45 

.82 

2 

20 

969 

— 

1.18 

.37 

.33 

.37 

.40 

.99 

3 

20 

968 

— 

1.56 

.40 

.42 

.32 

.43 

1.07 

At  a  test  length  of  80  items,  the  SEE  curves  were  highly  stable,  as  clearly 
shown  in  Table  9.  Similar  to  the  effect  noted  with  test  lengths  of  20,  the  ex¬ 
pected  decrease  in  variation  of  the  SEE  with  increase  in  sample  size  was  appar¬ 
ent  only  at  ability  levels  of  -1,  +1,  and  +2. 

Effect  of  Test  Length 


Examination  of  the  results  reported  in  Tables  7  through  9  indicate  that  for 
samples  of  size  50,  as  test  length  Increased,  variation  in  the  SEE  curves  de¬ 
creased  at  all  ability  levels.  Results  of  the  simulations  for  sample  sizes  of 
200  and  1,000  clearly  show  the  following  trends: 

The  most  stable  SEE  curves  were  obtained  for  the  longest  test  length; 
and 

For  all  ability  levels,  variation  in  the  SEE  curves  decreased  as  test 
length  increased. 


Figure  1  illustrates  the  effect  of  test  length  and  sample  size  on  the  sta¬ 
bility  of  the  SEE  curves  at  five  ability  levels  for  Item  Pool  1.  Each  graph 
represents  a  plot  of  the  values  of  the  SEE  curves  obtained  when  sample  size  was 
held  constant  and  test  length  was  varied.  It  is  clear,  from  examination  of 
these  graphs,  that  sample  size  has  little  effect  on  the  stability  of  SEE  curves 
of  the  10-item  tests.  The  effect  of  sample  size  on  the  stability  of  the  SEEs 


1. 


Summary 
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Table  9 

Standard  Error  Estimates  (SEE)  Adjusted  to  Correspond 
to  80-Item  Tests  for  Various  Sample  Sizes 
and  Ability  Levels  with  a  Heterogeneous  Item  Pool 


Sample  Size 
and 

Actual 

Test  Sample 

Ability  Level 

Replication 

Length 

Size 

-3.0 

-2.0 

-1.0 

0.0 

1.0 

2.0 

3.0 

50 

1 

74 

50 

1.10 

.35 

.14 

.14 

.24 

.24 

.45 

2 

79 

50 

1.06 

.48 

.25 

.17 

.13 

.32 

.49 

3 

77 

50 

.93 

.20 

.19 

.15 

.17 

.29 

.48 

200 

1 

80 

200 

.89 

.26 

.22 

.24 

.19 

.25 

.44 

2 

80 

200 

.62 

.29 

.25 

.19 

.21 

.25 

.46 

3 

80 

200 

1.06 

.35 

.21 

.19 

.20 

.25 

.48 

1000 

1 

80 

999 

1.00 

.35 

.23 

.21 

.21 

.24 

.40 

2 

80 

1000 

.98 

.32 

.23 

.22 

.21 

.23 

.43 

3 

80 

1000 

1.08 

.34 

.20 

.21 

.20 

.24 

.46 

was  most  apparent  for  the  20-item  tests.  For  the  80-item  tests  sample  size 
showed  the  most  pronounced  effect  when  there  was  an  increase  from  50  examinees 
(Figure  la)  to  200  examinees  (Figure  lc).  An  effect  was  also  noticed  when  sam¬ 
ple  size  was  increased  from  200  examinees  (Figure  la)  to  1,000  examinees  (Figure 
lc);  however,  the  improvements  in  precision  were  more  modest  in  size. 

Table  10  summarizes  the  data  reported  in  Tables  7  through  9  and  includes 
summary  data  for  Item  Pool  2.  Entries  in  this  table  are  the  standard  deviations 
of  the  SEEs  obtained  across  the  three  replications  of  the  various  studies. 
Standard  deviations  are  reported  for  each  test  length-sample  size  combination 
across  five  ability  levels.  Also  included  in  Table  10  is  the  average  of  the 
standard  deviations  across  ability  levels  for  each  combination  of  test  length 
and  sample  size. 

Several  trends  are  apparent  from  examination  of  the  average  variation  of 
the  SEEs  for  Item  Pool  1:  (1)  the  variation  decreased  as  test  length  increased 

for  all  sample  sizes;  (2)  when  test  length  was  fixed  at  10  items,  sample  size 
had  little  or  no  effect  on  the  stability  of  the  SEE  curves;  and  (3)  sample  size, 
generally,  had  a  noticeable  effect  on  the  stability  of  the  SEE  curves. 

Examination  of  the  average  variation  across  ability  levels  for  Item  Pool  2 
indicated  that  for  all  test  lengths,  sample  size  has  a  noticeable  effect  on  the 
stability  of  SEE  curves.  In  comparison  to  the  results  reported  for  Item  Pool  1, 
the  effect  of  test  length  on  the  average  variation  across  ability  levels  was  not 
so  apparent.  The  reason  for  this  is  the  smaller  variation  observed  for  short 
tests  with  this  particular  item  pool. 


The  results  in  Table  10  indicate  that  for  tests  of  20  and  80  items,  the 
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Figure  1 

Standard  Errors  of  Estimate  (SEE)  for  Three  Test  Lengths  (10,  20, 
and  80  Test  Items),  Five  Ability  Levels  and 
Three  Sample  Sizes  (50,  200,  and  1000  Examinees) 

(Each  Combination  of  Conditions  Was  Replicated  Three  Times) 


-  tc«C 

. 20-lesa  tut 

-  80-lt«s  t«»t 


(a)  50  Examinees  (b)  200  Examinees  (c)  1000  Examinees 


Ability  Level  Ability  Level  Ability  Level 

variation  in  the  SEE  curves,  averaged  across  ability  levels,  was  very  similar 
for  both  item  pools.  For  test  lengths  of  10,  the  situation  was  quite  different. 
In  order  to  make  the  average  variation  across  ability  levels  at  this  test  length 
comparable  for  both  item  pools,  these  values  were  recomputed  for  Item  Pool  2 
excluding  the  values  obtained  for  ability  level  of  -2.  The  recomputed  average 
variation  values  were  .33,  .38,  and  .52  for  sample  sizes  of  50,  200,  and  1,000, 
respectively.  It  is  clear  that  for  short  tests,  the  homogeneous  item  pool  (Item 
Pool  1)  resulted  in  smaller  average  variations  than  did  the  heterogeneous  item 
pool.  A  second  point  worth  noting,  is  that  the  heterogeneous  item  pool  (Item 
Pool  2)  provided  more  stable  SEEs  at  an  ability  of  -2  for  test  lengths  of  10  or 
20  items  than  did  the  homogeneous  item  pool.  For  test  lengths  of  80,  the  re¬ 
sults  appear  to  be  about  the  same  for  both  item  pools. 

Conclusions 


This  study  has  provided  data  concerning  the  size  of  improvements  in  SEE 
curves  relative  to  the  three  factors  under  investigation:  (1)  sample  size,  (2) 
test  length,  and  (3)  item  pool  characteristics.  Several  conclusions  appear  to 
be  warranted: 


k 
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Table  10 

Standard  Deviations  of  Standard  Errors  of  Estimates 
Across  Three  Replications  at  Several  Ability  Levels 
for  Different  Test  Lengths  and  Examinee  Sample  Sizes, 
and  for  the  Heterogeneous  Item  Pool  (Pool  1) 
and  the  Homogeneous  Item  Pool  (Pool  2) 


Sample 

Average 

Size 

Variation 

and 

Across 

Test 

Item 

Ability  Level  Ability 

Length 

Pool 

-2.0  -1.0  0.0  1.0  2.0  Levels 

10  50 


Pool 

1 

.68 

.20 

.39 

.23 

.50 

.40 

Pool 

2 

.17 

.11 

.41 

.28 

.24 

200 

Pool 

1 

OO 

Ln 

.10 

.60 

.50 

.31 

.47 

Pool 

2 

.03 

.07 

.03 

.22 

.09 

1000 

Pool 

1 

.32 

.04 

.47 

1.40 

.19 

.60 

Pool 

2 

.07 

.03 

.04 

.03 

.04 

50 

Pool 

1 

.36 

.10 

.01 

.05 

.11 

.16 

Pool 

2 

.78 

.07 

.10 

.05 

.08 

.22 

200 

Pool 

1 

.38 

.06 

.06 

.04 

.06 

.12 

Pool 

2 

.37 

.00 

.02 

.04 

.00 

.09 

1000 

Pool 

1 

.22 

.05 

.04 

.02 

.02 

.09 

Pool 

2 

.50 

.03 

.01 

.00 

.02 

.11 

50 

Pool 

1 

.11 

.04 

.01 

.05 

.03 

.06 

Pool 

2 

.16 

.04 

.01 

.02 

.04 

.05 

200 

Pool 

1 

.04 

.02 

.02 

.01 

.00 

.02 

Pool 

2 

.03 

.01 

.01 

.01 

.01 

.01 

1000 

Pool 

1 

.01 

.01 

.00 

.00 

.00 

.00 

Pool 

2 

.02 

.00 

.00 

.00 

.01 

.01 

1.  Both  test  length  and  sample  size  are  extremely  important  factors  in  the 
precision  of  SEE  curves.  The  small  number  of  reversals  in  the  results 
was  no  doubt  due  to  sampling  fluctuations. 

2.  At  the  extremes  of  an  ability  continuum  precision  of  SEE  curves  is  very 
poor,  even  with  large  examinee  sample  sizes.  The  results  are  substan¬ 
tially  better  when  tests  are  lengthened,  even  if  the  sample  size  is 
small  (N  »  50). 

The  precision  of  SEE  curves  would  be  acceptable  in  most  instances  if 


3. 


the  curves  are  based  on  200  or  more  examinees  with  test  lengths  of  at 
least  20  items.  This  recommendation  holds  if  primary  concern  is  with 
values  of  the  curves  in  middle  regions  of  the  ability  continuum  [—1  to 

+U- 

4.  Increases  in  examinee  sample  sizes  from  50  to  200  produce  sizeable  im¬ 
provements  in  the  precision  of  SEE  curves;  however,  gains  in  precision 
due  to  increasing  a  sample  size  from  200  to  1,000  produce  only  modest 
gains  in  precision  of  the  SEE  curves. 

5.  Similarly  for  test  lengths,  improvements  in  precision  were  substantial¬ 
ly  better  when  the  change  was  from  10  to  20  items  than  from  20  to  80 
items. 

The  results  of  this  study  suggest  that  if  an  item  pool  is  typical,  the  sta¬ 
bility  of  SEE  curves  across  readministrations  of  the  test  to  similar  groups  of 
examinees  will  be  quite  good  if  the  test  includes  at  least  20  items  and  if  200 
or  more  examinees  are  used  in  deriving  the  item  statistics. 
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Distribution  of  Ability 
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Estimation  of  the  parameters  in  the  2-parameter  logistic  latent  trait  model 
will  be  discussed  within  the  framework  of  the  estimation  procedure  developed  by 
Bock  and  employed  in  the  LOGOG  computer  program  (Kolakowski  &  Bock,  1973).  This 
method  of  estimation  requires  the  assumption  of  some  prior  distribution  of  abil¬ 
ities  during  estimation  of  the  item  parameters  (although  no  distributional  as¬ 
sumption  is  required  during  estimation  of  the  ability  parameters).  Typically,  a 
normal  prior  is  assumed  during  item  parameter  estimation.  The  questions  to  be 
explored  in  this  monte  carlo  study  are  (1)  What  effect  does  this  method  of  esti¬ 
mation  have  on  the  estimated  abilities  when  the  true  distribution  of  abilities 
is  nonsymmetr ic?  and  (2)  Since  the  entire  procedure  is  defined  only  to  within  a 
linear  transformation,  does  there  exist  a  linear  function  of  the  data  that  will 
improve  the  accuracy  of  the  estimated  abilities  in  this  situation?  The  monte 
carlo  simulation  presented  here  reveals  a  plausible  and  simple  candidate:  a  lin¬ 
ear  transformation  using  the  means  of  the  item  difficulties  and  item  discrimina¬ 
tions.  However,  theoretical  support  in  closed  form  for  this  solution  is  still 
forthcoming. 

The  motivation  for  seeking  some  function  of  the  estimated  item  parameters 
to  adjust  the  estimated  abilities  stems  from  the  following  well-known  fact  re¬ 
garding  the  value  of  the  discrimination  parameter  in  the  1-parameter  logistic 
model,  commonly  known  as  the  Rasch  model.  The  Rasch  model  is  typically  written 
as  in  Equation  1. 
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Alternatively,  it  may  be  written  as  in  Equations  2  and  3. 
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Examination  of  Equations  2  and  3  makes  it  explicit  that  the  1-parameter  Rasch 
model  may  in  fact  have  item  discrimination  parameters  that  are  all  equal  to  some 
constant  value,  say  a.  The  value  of  the  constant  will  in  most  cases  be  unknown, 
as  it  will  be  considered  in  the  variance  of  the  distribution  of  the  estimated 
abilities.  Since  this  unknown  item  parameter  may  affect  the  distribution  of  the 
abilities,  it  is  possible  that  unknown  parameters  of  the  distribution  of  abili¬ 
ties  may  affect  the  item  parameters  in  a  discernible  way. 

The  Estimation  Procedure 

The  entire  estimation  procedure  is  performed  in  two-step  cycles.  Estima¬ 
tion  of  abilities  using  the  current  item  parameter  estimates  is  the  first  step, 
and  estimation  of  the  item  parameters  using  the  current  ability  estimates  is  the 
second.  In  each  cycle  the  mean  and  variance  of  the  ability  continuum  are  stan¬ 
dardized  to  0  and  1,  respectively.  The  cycling  continues  until  stable  item  pa¬ 
rameters  are  reached. 

Estimation  of  Ability 

Estimation  of  abilities  by  maximum  likelihood  in  this  procedure,  when  spe¬ 
cialized  to  binary  choice  data,  is  accomplished  by  the  standard  method  as  fol¬ 
lows  : 


Let  j_  =  1 ,  . .  .  ,  _n  items; 

_fl  if  person  _i  is  correct  on  item  j_; 

—  ij  |0  if  person  i_  is  incorrect  on  item  j_; 

=  the  latent  ability  of  person  i; 

8j  =  the  item  difficulty  parameters;  and 
oij  =  the  item  discrimination  parameters. 

Then,  for  a  given  person  i  the  likelihood  function  is 


L  .  ( 0  .  r  . )  =  n  P  .  . 

t  7  1  t  .  ,  1,7 

,7  =  1 


_  .  1  -  V  .  . 
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where  =  Pr(r^j  ®  1 1  0 ± ) ;  here 
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Therefore,  the  log  likelihood  is  given  by 
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Given  the  first  and  second  derivatives  of  the  log  likelihood,  Newton-Raph- 
son  iteration  may  be  applied  to  the  k*-*1  stage  estimate  of  the  parameter  to  yield 
the  (£  +  l)st  stage  estimate: 


[7] 


Estimation  of  Item  Parameters 


Estimation  of  the  item  parameters — difficulty,  3 j ,  and  discrimination, 
oij — is  not  accomplished  in  the  standard  procedure  (as  described,  for  example,  in 
Lord,  1963).  Instead,  the  item  parameters  are  estimated  under  the  assumption 
that  the  abilities  follow  a  previously  specified  distribution;  here  the  normal 
distribution  is  used,  with  a  mean  and  variance  of  0  and  1,  respectively.  This 
is  accomplished  at  each  cycle  by  taking  the  current  estimates  of  abilities, 
ranking  them,  and  distributing  them  into  10  groups  or  fractiles  in  such  a  way 

that  the  number  of  subjects,  across  the  i.  *  1 .  10  fractiles  reflect  the 

normal  distribution.  Then,  it  is  assumed  that  within  each  fractile  the  subjects 
are  sufficiently  homogeneous  to  permit  proceeding  as  though  there  are  Nj  inde¬ 
pendent  observations  all  at  the  same  ability  level,  8^,  some  middle  value  in 
fractile  i.  Formally,  the  procedure  is  as  follows: 


Let  i_ 
N. 


=  1,  2,...,  £  groups  or  fractiles  whose  subjects  are 
sufficiently  homogeneous  as  to  be  characterized  by 
=  number  of  subjects  in  fractile  £  (determined  by 
the  assumed  distribution); 

=  number  of  subjects  in  group  i  who  respond  to 
item  £  correctly; 

=  (£ij}  =  vector  of  item  responses  to  item 
£  across  the  £  fractiles; 

-  the  required  difficulty  parameter;  and 
=  the  required  slope  parameter. 


0 


i: 


Then,  for  a  given  item  j  the  likelihood  function  is 
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Given  the  matrices  of  first  and  second  derivatives,  Newt on-Raph son  iteration 
may  be  applied  to  the  k*-*1  stage  estimate  of  the  parameters  to  yield  the  (k  + 
l)st  stage  estimates: 
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The  Data 


The  choice  of  the  distribution  of  abilities  for  the  monte  carlo  data  was 
made  to  approximate  an  available  set  of  data.  The  following  distribution 
function  was  used: 


F(0) 


[11] 


with  corresponding  density  function 


/(  9) 


[121 


The  theoretical  median,  mean,  variance,  and  coefficient  of  skewness  were,  re¬ 
spectively. 


Md  =  Median  =  -1.03555 


[13] 


[  (e-y  > 2  ] 


1.3888  .  .  . 


+4.4921. 


and  the  density  is  approximated  by  Figure  1. 

Figure  1 

The  Distribution  of  0 


P 


The  abilities  ranged  from  -2.5  to  +2.5,  with  75%  of  the  population  lying  between 
-2.5  and  0. 


A  sample  of  N  =  480  abilities  was  generated  by  obtaining  a  random  number  in 
the  unit  interval  for  the  value  of  the  probability  F  and  applying  the  inverse  of 
the  distribution  function: 

i  .  i  -vi  +  «»/«  -  n  ,  [i7] 


For  each  simulated  subject,  responses  to  n  =  45  items  were  generated.  The  dif¬ 
ficulties  of  these  items  were  set  at  values  between  -2.2  and  +2.2  in  steps  of 
.1;  the  discriminations  were  all  set  equal  to  1.  Each  subject's  responses  to 
these  45  items  were  generated  by  calculating  the  probability  of  a  correct  re¬ 
sponse,  Pi j ,  using  these  item  parameters  and  the  subject's  0,  and  then  comparing 
to  a  random  number  £  in  the  unit  interval.  For  each  item 


1  P  .  .  >  p 

0  P  .  .  <  p 

t.1  K 
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The  criterion  used  to  determine  successful  estimation  of  the  sample's  abil¬ 
ity  parameters  was  as  follows:  Construct  approximate  95%  confidence  intervals 
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around  each  subject's  estimated  ability  using  the  estimate  of  the  asymptotic 
variance  of  8^  given  by  the  negative  of  the  inverse  of  the  second  derivative  of 

the  log  likelihood  function.  Then,  simply  count  the  number  of  subjects  whose 
95%  confidence  interval  failed  to  cover  the  true  ability  and  compare  this  number 
to  the  expected  number  from  a  binomial  distribution  with  j)  =  .05. 

Results 


The  results  of  the  monte  carlo  study  are  as  follows: 


1.  Estimating  the  abilities  using  the  above  procedure  and  placing  a  95% 
confidence  interval  around  each  estimated  ability  yielded  353  out  of 
the  480  simulated  subjects  for  which  the  95%  confidence  interval 
failed  to  cover  the  true  ability. 

2.  The  mean  of  the  estimated  item  difficulties  was  b  =  0.898;  the  mean  of 
the  estimated  item  discriminations  was  a  =  1.274. 

3.  Applying  the  linear  transformation 


~ 

9  .  =  b  +  a  9  . 
%  i 


[19] 


A 

and  the  appropriate  adjustments  to  the  variance  of  the  0 's,  yielding  a 
standard  error  of  <J0*  =  aicr0  and  then  placing  95%  confidence  intervals 

around  the  transformed  ability  estimates,  9*,  yielded  31  out  of  480 
subjects  for  which  the  95%  confidence  interval  around  the  transformed 
ability  estimate  failed  to  cover  the  true  ability — a  result  which  did 
not  differ  significantly  (j>  >  .09)  from  the  expected  number  of  24  out 
of  480  subjects.  In  other  words,  with  this  transformation  procedure, 
successful  recovery  of  abilities  was  obtained. 


Discussion 


The  results  of  this  study  should  be  neither  over-interpreted  nor  under¬ 
interpreted.  Although  the  study  was  based  on  only  one  sample  of  monte  carlo 
data,  the  random  number  sequence  utilized  to  generate  these  data  was  thoroughly 
checked  for  serial  correlation  and  uniform  distribution,  utilizing  the  proce¬ 
dures  presented  in  Hammersley  and  Handscomb  (1964,  chap.  3).  Since  the  latent 
continuum  was  standardized  to  a  mean  of  0  and  variance  of  1  at  each  cycle,  the 
shift  in  the  mean  of  the  difficulties  as  well  as  the  shift  in  the  mean  of  the 
discriminations  cannot  be  interpreted  simply  as  resulting  from  a  failure  to 
standardize  the  latent  continuum. 

Nevertheless,  these  are  monte  carlo  results  which  are  only  at  best  loosely 
supported  by  theory.  In  addition,  the  behavior  of  this  procedure  in  other  cir¬ 
cumstances  in  unknown,  i.e.,  change  the  distribution  of  abilities  or  the  distri¬ 
bution  of  either  of  the  item  parameters,  and  the  adequacy  of  the  procedure  for 
recovering  ability  is  undemonstrated.  Consequently,  extreme  caution  is  recom¬ 
mended  before  utilizing  the  correction  presented  here. 
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The  study  does  support  the  contention  that  there  Is  an  Intimate  connection 
between  Item  parameter  and  ability  parameter  estimation.  Although  almost  all 
estimation  procedures  in  latent  trait  theory  utilize  the  conditional  two-step 
procedure — estimation  of  ability  parameters  followed  by  estimation  of  item  pa¬ 
rameters— estimation  of  the  two  sets  of  parameters  is  not  independent.  Conse¬ 
quently,  latent  trait  methods  that  attempt  to  use  a  particular  procedure  in  es¬ 
timation  but  that  begin  by  assuming,  for  example,  that  the  item  parameters  are 
known  and  then  present  a  "solution”  to  a  particular  problem  for  ability  estima¬ 
tion,  given  known  item  parameters,  are  likely  to  be  of  limited  practical  utili¬ 
ty- 


REFERENCES 

Hammer sley,  J.  M. ,  &  Handscomb,  D.  C.  Monte  carlo  methods.  Norwich,  Great 
Britain:  Fletcher  &  Son,  Ltd.,  1964. 

Kolakowski,  D. ,  &  Bock,  R.  D.  LOGOG:  Maximum  likelihood  item  analysis  and  test 
scoring — logistic  model  for  multiple  item  responses.  Chicago:  National 
Educational  Resources,  1973. 

Lord,  F.  M.  An  analysis  of  the  verbal  scholastic  aptitude  test  using  Blrnbaum's 
three-parameter  logistic  model.  Educational  and  Psychological  Measurement, 
1968,  28,  989  -  1020. 


"f" 


Estimation  of  Parameters  in  the 
3-Parameter  Latent  Trait  Model 


Hariharan  Swaminathan  and  Janice  Gifford 
University  of  Massachusetts 


The  successful  application  of  latent  trait  theory  to  practical  measurement 
problems  hinges  upon  the  availability  of  procedures  for  the  estimation  of  the 
parameters.  Hence,  investigations  of  the  adequacy  of  the  available  procedures 
for  estimating  parameters  in  latent  trait  models  are  necessary  and,  indeed,  play 
a  crucial  role  when  assessing  the  usefulness  of  latent  trait  theory. 

Although  the  problem  of  estimating  parameters  in  the  1-parameter  latent 
trait  model  appears  to  be  solved,  some  degree  of  controversy  seems  to  surround 
the  estimation  of  parameters  in  the  2-  and  3-parameter  models  (Andersen,  1973; 
Wright,  1977).  Lord  (1975)  has  empirically  evaluated  the  maximum  likelihood 
procedure  for  estimating  the  parameters  in  the  3-parameter  model  and  has  pro¬ 
vided  answers  to  some  of  the  questions  that  arise  with  respect  to  estimation  of 
parameters.  Jensema  (1976)  has  compared  the  efficiency  of  a  heuristic  procedure 
suggested  by  Urry  (1974)  for  estimating  the  parameters  in  the  3-parameter  model 
with  the  maximum  likelihood  procedure.  Ree  (1979)  has  compared  the  properties 
of  the  Urry  estimators  and  the  maxim’.mi  likelihood  estimators  and  has  investi¬ 
gated  the  effect  of  violating  the  underlying  assumptions  on  the  estimates,  fix¬ 
ing  the  test  length  (80  items)  and  the  number  of  examinees,  however.  Despite 
these  efforts,  little  is  known  regarding  the  statistical  properties  of  the  esti¬ 
mators  in  the  3-parameter  model  and  the  effect  of  test  length  and  examinee  popu¬ 
lation  size  on  the  estimates. 

Purpose 

The  purpose  of  this  study  was  to  investigate  the  efficiency  of  the  Urry 
(1976)  procedure  and  the  maximum  likelihood  procedure  for  estimating  parameters 
in  the  3-parameter  model,  to  study  the  properties  of  the  estimators,  and  to  pro¬ 
vide  some  guidelines  regarding  the  conditions  under  which  they  should  be  em¬ 
ployed.  In  particular,  the  issues  investigated  were  (1)  the  "accuracy”  of  the 
two  estimation  procedures,  (2)  the  relationship  between  the  number  of  items, 
examinees,  and  the  accuracy  of  estimation,  (3)  the  effect  of  the  distribution  of 
ability  on  the  estimates  of  item  and  ability  parameters,  and  (4)  the  statistical 
properties,  such  as  bias  and  consistency,  of  the  estimators. 

Design  of  the  Study 

In  order  to  investigate  the  issues  mentioned  above,  artificial  data  were 


-  373  - 


generated  according  to  the  3-parameter  logistic  model 

P..(0)  =  a .  +  (1  -  a.)  {1  +  exp [-1.7  a. (6  .  -  £>.)]}  [1] 

ij  i  1  t  0  i 

using  the  DATGEN  program  of  Hambleton  and  Rovinelli  (1973).  Data  were  generated 
to  simulate  various  testing  situations  by  varying  the  test  length,  the  number  of 
examinees,  and  the  ability  distribution  of  the  examinees.  Test  lengths  were 
fixed  at  10  items,  15  items,  20  items,  and  80  items.  Since  the  accuracy  of  max¬ 
imum  likelihood  estimation  with  large  numbers  of  items  has  been  sufficiently 
documented  by  Lord  (1975),  tests  with  small  numbers  of  items — 10,  15,  and  20-- 
were  chosen  so  that  the  accuracy  of  the  estimation  procedure  could  be  ascer¬ 
tained  for  short  tests.  This  is  particularly  important  if  latent  trait  theory 
is  to  be  applied  to  criterion-referenced  measurement.  Similarly,  the  sizes  of 
examinee  population  were  set  at  50,  200,  and  1,000  in  order  to  study  the  effect 
of  small  sample  size  on  the  accuracy  of  estimation. 

In  the  Urry  (1976)  estimation  procedure,  the  relationships  that  exist  for 
item  discrimination  and  item  difficulty  between  the  latent  trait  theory  parame¬ 
ters  and  the  classical  item  parameters  are  exploited  (Lord  &  Novick,  1968,  pp. 
376-378).  These  relationships  are  derived  under  the  assumption  that  ability  is 
normally  distributed  and  that  the  item  characteristic  curve  (ICC)  is  the  normal 
ogive.  In  order  to  study  how  the  departures  from  the  assumption  of  normally 
distributed  abilities  affect  the  Urry  procedure,  three  ability  distributions 
were  considered:  normal,  uniform,  and  a  negatively  skewed  distribution.  The 
normal  and  uniform  distributions  were  generated  with  mean  0.0  and  variance  of 
1.0.  (The  uniform  distribution  was  generated  on  the  interval  -1.73  to  1.73  to 
ensure  unit  variance.)  A  beta  distribution  with  parameters  5  and  1.5  was  gener¬ 
ated  to  simulate  a  negatively  skewed  distribution,  and  then  rescaled  so  that  the 
mean  was  0.0  and  the  variance  1.0.  The  distributions  were  standardized  to  re¬ 
move  the  effect  of  scaling  on  the  estimates  of  the  parameters. 

The  three  factors — test  length  (4  levels),  examinee  population  size  (3  lev¬ 
els),  and  ability  distribution  (3  levels) — were  completely  crossed  to  simulate 
36  testing  situations.  Test  data  arising  from  these  situations  were  subjected 
to  the  Urry  estimation  procedure  using  the  computer  program  ANCILLES  and  to  the 
maximum  likelihood  estimation  procedure  using  the  computer  program  L0GIST  (Wood, 
Wingersky,  &  Lord,  1978). 

Lord  (1975)  has  emphasized  the  fact  that  simulated  data  should  in  some  way 
resemble  real  data;  otherwise,  results  obtained  through  simulation  studies  will 
not  generalize  to  real  situations.  An  attempt  was  therefore  made  to  generate 
test  data  as  realistically  as  possible.  In  order  to  accomplish  this,  item  dif¬ 
ficulty  parameters  were  sampled  from  a  uniform  distribution  defined  in  the  in¬ 
terval  b  -  -2.0  to  2.0,  and  item  discrimination  parameters  were  sampled  from  a 
uniform  distribution  in  the  interval  £  ■  .6  to  2.0.  Since  data  were  generated 
to  simulate  item  responses  to  multiple-choice  items  with  four  choices,  the 
pseudo-chance  level  parameters  were  set  at  c  -  .25.  It  should  be  noted,  howev¬ 
er,  that  this  does  not  ensure  close  approximation  of  the  generated  data  to  real 
data.  Combinations  of  item  difficulty  and  discrimination  that  may  not  occur  in 
constructed  tests  may  occur  with  simulated  tests  and,  hence,  may  affect  the  es- 
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timation  procedures,  limiting  the  generalizability  of  the  findings  in  simulated 
studies  to  real  situations.  On  the  other  hand,  since  the  purpose  of  this  study 
was  to  compare  two  estimation  procedures  and  to  study  the  statistical  properties 
of  estimators,  the  possible  lack  of  correspondence  between  simulated  and  real 
data  may  not  be  a  serious  problem. 


Results 


Accuracy  of  Estimation 


Comparisons  between  ANCILLES  and  LOGIST  across  various  test  lengths,  exam¬ 
inee  population  sizes,  and  ability  distributions  are  indicated  in  Tables  1,  2, 
and  3.  The  statistics  reported  are  (1)  the  mean,_y,  of  the  population  item  pa¬ 
rameters  for  each  population  size;  (2)  the  mean,  X,  of  the  estimated  item  param¬ 
eters;  and  (3)  the  correlation,  p,  between  the  true  parameters  and  their  esti¬ 
mates.  These  statistics  are  reported  for  the  estimates  obtained  by  employing 
both  ANCILLES  and  LOGIST. 

A  comparison  of  the  mean  of  the  generated  item  parameters,  y,  and  the  mean 
of  the  estimates.  If,  for  each  of  the  item  parameters— discrimination  (a) ,  diffi¬ 
culty  (_b) ,  pseudo-chance  level  (c) ,  and  the  ability  (9)  parameters — provides 
some  indication  of  the  accuracy  of  estimation.  However,  this  comparison  is 
rather  weak  when  carried  out  alone,  since  the  means  do  not  contain  all  the  es¬ 
sential  information.  Simultaneous  comparisons  of  the  means  and  examination  of 
the  correlations  between  the  parameters  and  estimates,  on  the  other  hand,  pro¬ 
vide  more  complete  information  regarding  the  accuracy  of  estimation.  If  the 
correlation  is  high,  and  the  means  differ,  then  it  can  be  concluded  that  the 
estimation  was  not  sufficiently  accurate. 

Lord  (1975)  has  implied  that  if  heteroscedasticity  exists,  it  may  not  be 
meaningful  to  compute  correlations  betwc~”  true  and  estimated  values,  and,  in 
general,  the  authors  of  this  paper  agree.  However,  since  in  the  strict  sense 
heteroscedasticity  will  invalidate  the  computation  of  a  least  squares  regression 
line— the  more  appropriate  criterion  to  employ  is  the  generalized  least  squares 
criterion— and  hence  will  rule  out  the  use  of  simple,  interpretable  statistics 
for  the  evaluation  of  the  accuracy  of  estimation,  heteroscedasticity  (when  it 
occurred)  was  ignored;  and  correlations  and  least  squares  regression  equations 
were  computed. 

Estimation  of  the  discrimination  parameter.  Examination  of  the  results  in 
Tables  l,  2,  and  3  indicates  that  the  a  parameter  was  poorly  estimated  for  short 
tests.  The  highest  correlation  between  true  values  and  estimates  for  a  test 
with  10  items  and  normally  distributed  ability  was  .36,  with  the  mean  of  the 
estimates  exceeding  the  mean  of  the  true  values.  The  correlations  improved  with 
increasing  sample  size  and  test  length,  with  the  mean  of  the  estimated  values 
approaching  the  mean  of  the  true  values  from  above.  The  highest  correlation 
between  the  estimated  and  true  values  was  .88  for  an  80-item  test  with  1,000 
examinees.  This  trend  was  also  evident  for  the  uniform  and  negatively  skewed 
distributions  of  ability.  In  general,  the  a  parameter  was  poorly  estimated  by 
ANCILLES,  with  the  estimation  improving  more  rapidly  with  increasing  test  length 
than  with  increasing  examinee  population  size. 
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The  least  squares  regression  lines  (for  normally  distributed  ability)  for 
predicting  the  estimates  from  true  values,  given  in  Table  4,  were  plotted  (not 
shown)  and  compared  with  the  line  y_  =  jc  in  order  to  determine  the  extent  of  the 
bias  in  estimation.  The  regression  lines  for  all  the  test-length  and  sample- 
size  combinations  fell  above  the  line  y_  =  sc,  indicating  that  ANCILLES  systemati¬ 
cally  overestimated  the  a  parameter,  with  the  regression  lines  approaching  the 
line  y^  =  x  with  increasing  test  length.  Again,  the  convergence  to  the  line  y  =* 
x  was  more  rapid  with  increasing  test  length  than  with  increasing  sample  size. 

Trends  similar  to  that  observed  with  ANCILLES  were  also  observed  with  L0- 
GIST.  Although  the  estimation  of  a  was  poor,  the  LOGIST  estimates  were  consist¬ 
ently  better  than  those  from  ANCILLES  in  that  the  correlations  between  true  val¬ 
ues  and  estimates  were  higher  and  the  means  of  the  estimates  were  much  closer  to 
the  means  of  the  true  values.  Comparison  of  the  plots  of  the  regression  lines, 
given  in  Table  4,  with  the  line  y  =  x  showed  that  although  there  was  a  general 
tendency  for  the  parameters  to  be  overestimated,  this  tendency  was  not  as  marked 
as  with  ANCILLES;  the  convergence  of  the  regression  lines  to  the  line  y  =  x  was 
more  rapid.  These  trends — the  higher  correlations  between  true  and  estimated 
values  than  for  ANCILLES  estimates,  the  tendency  for  the  means  of  the  estimates 
to  be  closer  to  the  means  of  the  true  values,  and  the  rapidity  of  convergence  of 
the  regression  line  to  the  line  y  =  x— were  also  observed  with  the  uniform  and 
negatively  skewed  distribution  of  ability. 

Estimation  of  the  difficulty  parameter.  ANCILLES  was  very  successful  in 
providing  accurate  estimates  of  the  _b  parameter.  The  correlations  between  esti¬ 
mates  and  true  values  ranged  from  .85  to  .99.  Comparison  of  the  regression 
lines  for  normally  distributed  ability,  given  in  Table  4,  with  the  line  y  =  x 
indicated  that  with  the  exception  of  tests  with  10  items,  the  b  parameter  was 
generally  overestimated  for  tests  with  15  and  20  items.  With  Targer  numbers  of 
items,  there  was  a  tendency  for  difficult  items  to  be  overestimated  and  for  easy 
items  to  be  underestimated.  However,  the  bias  was  slight  in  that  the  conver¬ 
gence  of  the  regression  line  to  the  line  y^  =  x  was  rapid  with  increasing  items 
and  sample  size. 

In  general,  the  LOGIST  estimates  of  the  b  parameters  were  better  than  the 
estimates  produced  by  ANCILLES.  The  correlations  between  true  and  estimated 
values  ranged  from  .88  to  1.00,  whereas  ANCILLES  yielded  correlations  ranging 
from  .85  to  .99.  The  means  of  the  estimates  were,  in  general,  closer  to  the 
means  of  the  true  values  than  they  were  with  ANCILLES.  Comparisons  of  the  re¬ 
gression  lines,  given  in  Table  4,  with  the  line  _y  =  x  revealed  that  with  in¬ 
creasing  test  length  and  sample  size,  the  regression  line  approached  the  line  y 
=  x  rather  rapidly,  demonstrating  that  there  was  no  bias  in  the  estimation.  No 
clear  trends  were  visible  with  10,  15,  and  20  items,  although  the  test  with  10 
items  and  50  examinees  produced  overestimates  of  the  b  parameter.  These  results 
appeared  to  hold  for  both  the  uniform  and  negatively  skewed  distributions  of 
ability,  although  with  the  skewed  distribution  there  were  two  instances  when  the 
estimates  of  difficulty  went  out  of  bounds.  These  cases  are  indicated  with  an 
asterisk  in  Table  2.  However,  with  80  items  and  1,000  examinees,  the  agreement 
between  estimated  values  and  true  values  was  comparable  to  that  obtained  with 
normally  distributed  ability. 
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In  general,  the  parameter  was  estimated  rather  well  by  both  LOGIST  and 
ANCILLES.  LOGIST  fared  surprisingly  well  with  small  numbers  of  items  and  exam¬ 
inees  in  comparison  with  ANCILLES,  and  in  general  produced  better  estimates  (as 
determined  by  the  correlations)  than  did  ANCILLES. 

Chance-level  parameter.  The  true  value  of  the  chance-level  parameter  was 
set  at  c».25  for  all  the  items.  Given  this  lack  of  variation  among  the  true 
values,  correlations  between  estimates  and  true  values  were  not  computed. 

Hence,  only  the  mean  of  the  true  values,  the  mean  of  the  estimates,  and  the 
standard  deviation  of  the  estimates  are  reported  in  Tables  1,  2,  and  3. 

ANCILLES  clearly  produced  very  poor  estimates  of  the  c^  parameter.  The 
means  of  the  estimates  were  consistently  higher  than  the  mean  of  the  true  val¬ 
ues,  with  relatively  large  standard  deviations.  LOGIST  estimates,  on  the  other 
hand,  were  close  to  the  true  values,  with  small  standard  deviations.  The  mean 
LOGIST  estimates  ranged  from  .12  to  .25  for  normally  distributed  ability,  from 
.19  to  .25  for  skewed  distribution  of  ability,  and  from  .18  to  .25  for  uniformly 
distributed  ability.  In  comparison,  ANCILLES  yielded  estimates  that  ranged  from 
.20  to  .36,  .20  to  .56,  and  .22  to  .46,  respectively,  for  the  three  distribu¬ 
tions  of  ability. 

Estimation  of  ability.  An  examination  of  Tables  1,  2,  and  3  indicates  a 
consistent  pattern  in  the  estimation  of  ability  (0)  for  both  LOGIST  and  ANCIL¬ 
LES.  The  correlations  between  true  values  and  estimates  did  not  seem  to  be  af¬ 
fected  by  increasing  sample  sizes  for  fixed  test  lengths.  On  the  other  hand, 
increasing  the  lengths  of  the  test  greatly  affected  the  magnitude  of  the  agree¬ 
ment  between  true  values  and  estimates.  This  not  surprising  trend  held  for  the 
three  distributions  of  0. 

In  general,  it  appears  that  although  no  differences  existed  between  the 
ANCILLES  and  LOGIST  estimates  of  0  for  tests  with  15  items  or  more,  the  LOGIST 
estimates  fared  better  than  the  ANCILLES  estimates  for  short  tests  with  10 
items.  This  effect  was  more  pronounced  with  the  skewed  ability  distribution. 

A  closer  examination  of  the  two  estimates  by  comparing  the  regression  lines 
(obtained  by  regressing  the  estimates  on  the  true  values  with  the  line  jy  =  x) 
Indicated  that,  in  general,  ANCILLES  underestimated  9  for  examinees  with  high 
true  abilities  and  overestimated  0  for  examinees  with  low  true  abilities.  This 
may  partly  be  attributed  to  the  fact  that  the  c  parameters  were  overestimated. 

No  such  trends  were  evident  with  the  LOGIST  estimates.  These  regression  lines 
rapidly  converged  to  the  line  _y  *  x  with  increasing  test  length. 

Effect  of  Ability  Distribution 
2 

A  X  test  was  used  to  determine  if  the  uniform  and  the  beta  distributions 
deviated  sufficiently  from  the  normal.  The  beta  distribution  yielded  a  X  value 
of  63.5  when  the  tails  of  the  normal  distribution  were  excluded  and  a  val^e  of 
193.1  when  the  tails  were  included.  The  uniform  distribution  yielded  a  X  value 
of  69.6  when  tails  were  excluded  and  307.7  when  the  tails  were  included.  This 
indicates  that  both  distributions  deviated  sufficiently  from  the  normal,  with 
the  uniform  distribution  deviating  even  more  than  the  beta  distribution. 


cS' 
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Comparisons  of  Che  results  in  Tables  1,  2,  and  3  reveal  Chat,  in  general, 
Che  beta  distribution  affected  both  estimation  procedures,  while  the  uniform 
distribution  produced  results  similar  to  those  obtained  using  a  normal  ability 
distribution.  Although  the  beta  distribution  affected  the  estimation  of  £  for 
both  procedures  and  £  and  6  for  ANCILLES,  the  estimation  of  t)  did  not  seem  to  be 
affected  in  either  case.  ANCILLES  fared  poorly  with  the  skewed  distribution  in 
comparison  to  LOGIST  in  the  estimation  of  the  £,  £,  and  6  parameters. 

The  estimates  for  the  £  parameter,  resulting  from  both  procedures,  were 
negatively  correlated  with  the  true  values  for  short  tests.  For  longer  tests, 
although  estimates  from  both  procedures  improved,  ANCILLES  produced  poor  esti¬ 
mates  in  comparison  to  LOGIST.  For  an  80-item  test  with  1,000  examinees,  a  cor¬ 
relation  of  .68  was  obtained  using  ANCILLES,  as  compared  to  a  correlation  of  .82 
obtained  from  LOGIST. 

The  estimates  of  the  £  parameters  resulting  from  ANCILLES  were  extremely 
high  for  all  tests  except  those  of  80  items.  The  mean  values  ranged  from  .20  to 
.56  with  the  beta  distribution,  as  compared  to  a  range  of  .20  to  .36  for  the 
normal  distribution  of  ability.  The  LOGIST  estimates,  on  the  other  hand,  were 
underestimated  but  comparable  to  those  obtained  using  a  normal  distribution  of 
ability. 

The  LOGIST  estimates  of  ability  resulting  from  using  a  skewed  distribution 
of  ability  were  as  good  as,  and  in  some  cases  better  than,  the  estimates  ob¬ 
tained  with  a  normal  distribution.  In  contrast,  ANCILLES  with  a  skewed  distri¬ 
bution  resulted  in  poorer  estimates.  This  effect  held  true  even  as  sample  size 
and  test  length  increased. 

Thus,  ANCILLES  estimates  of  0 ,  £,  and  £  parameters  seemed  to  be  affected 
more  dramatically  than  the  LOGIST  estimates  when  ability  had  a  skewed  distribu-2 
tion.  It  should  be  noted  that  although  the  uniform  distribution  had  a  larger  X 
value  than  the  beta  distribution,  the  results  obtained  with  the  uniform  distri¬ 
bution  of  ability  were  similar  to  those  obtained  with  the  normal  distribution. 

It  is,  then,  not  departures  from  normality  but  departures  from  symmetry  and  the 
unavailability  of  examinees  in  the  lower  tail  of  the  ability  distribution  that 
affected  the  estimation  procedure. 

Statistical  Properties  of  Estimation 


Bias.  If  £  is  an  estimator  of  y,  then  £  is  an  unbiased  estimator  of  y  if 

E(y)  =  y,  [2] 

where  E(-)  is  the  expectation  operator.  This  is  a  desirable  property  of  estima¬ 
tors. 


Schmidt  (1977)  has  pointed  out  that  the  Urry  procedure,  developed  by  Urry 
in  1974,  systematically  overestimated  the  £  parameter  and  underestimated  the  b^ 
parameter.  Urry  (1976)  suggested  a  correction  for  this  and  incorporated  this- 
into  the  ANCILLES  program,  employed  to  estimate  parameters  in  this  study.  Since 
it  appears -that  for  large  numbers  of  items  and  examinees  the  estimates  are  un- 
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biased  (Lord,  1975),  in  order  to  study  the  effect  of  this  correction  on  the  es¬ 
timates  and  to  examine  if  the  LOGIST  estimates  were  unbiased  a  relatively  short 
test  of  20  items  with  200  examinees  was  selected,  response  data  were  generated, 
and  item  parameters  were  estimated;  this  was  replicated  20  times.  Since  the 
replications  were  obtained  by  generating  sets  of  random  examinees,  the  bias  in 
the  estimator  of  ability  was  not  investigated. 

The  results  of  the  replications  are  presented  in  Table  5,  in  which  the  true 
value,  y,  of  the  20  item  parameters  is  given  together  with  the  mean  estimate,  X, 
of  the  item  parameters  over  20  replications.  The  standard  error  and  the  t  value 
obtained  as 


t  =  ( X  -  p )/SE  [3] 

are  also  given  to  indicate  the  degree  of  departure  of  the  mean  estimate  from  the 
true  value. 

ANCILLES  clearly  overestimated  the  ji  parameter,  as  did  LOGIST.  However, 
the  bias  in  the  LOGIST  estimates  did  not  appear  to  be  as  severe  as  the  bias  in 
the  ANCILLES  estimates.  This_finding  is  borne  out  in  Figure  1,  where  the  re¬ 
gression  line  for  predicting  X  from  y  is  plotted  for  both  ANCILLES  and  LOGIST 
and  compared  with  the  line  y  =  x.  The  LOGIST  regression  line  is  closer  to  the 
line  y^  =  x  and  shows  that  small  values  of  a  were  overestimated,  while  very  large 
values  tended  to  be  estimated  accurately,  partly  due  to  the  fact  that  an  upper 
limit  was  imposed  on  the  estimates.  On  the  other  hand,  ANCILLES  tended  to  over¬ 
estimate  large  values,  even  more  than  small  values,  of 

With  item  difficulty,  LOGIST  tended  to  underestimate  easy  items,  while  pro¬ 
ducing  relatively  accurate  estimates  of  very  difficult  items  (Figure  2).  ANCIL¬ 
LES,  on  the  other  hand,  tended  to  overestimate  items  with  high  _b  levels  and  to 
underestimate  items  with  negative  b  levels.  In  general,  ANCILLES  seemed  to  pro¬ 
duce  biased  estimates  of  b^  throughout  the  entire  range. 

Consistency.  If  is  an  estimator  of  y ,  is  a  consistent  estimator  of  y 
if  for  any  positive  e  and  0  there  is  some  N  such  that 

Prob  {|g'K-y(<e}>l-n,n>^.  [4] 

Consistency  is  a  desirable  property  in  that  it  ensures  that  an  estimator  tends 
to  a  definite  quantity,  which  is  the  true  value  to  be  estimated. 

The  problem  of  consistency  has  raised  several  questions  concerning  the  es¬ 
timation  of  parameters  in  the  latent  trait  models.  Andersen  (1972)  has  argued 
that  a  consistent  estimator  of  the  discrimination  parameter  does  not  exist  and, 
hence,  has  questioned  the  meaningfulness  of  the  2-  and  3-parameter  models. 

In  order  to  investigate  whether  or  not  the  LOGIST  and  ANCILLES  estimators 
were  consistent,  the  regression  equation  for  predicting  the  estimates  from  the 
true  values  of  the  various  parameters  were  examined.  The  definition  for  a  con¬ 
sistent  estimator  given  earlier  implies  that  an  estimator  is  consistent  if  it  is 
asymptotically  unbiased  and  its  variance  tends  to  0.0  with  increasing  sample 
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Figure  1 

Bias  in  Che  Estimation  of  the  Discrimination  Parameter 
of  the  3-Parameter  Logistic  Model 


size.  Consequently,  in  order  for  the  estimators  of  the  latent  trait  parameters 
to  be  consistent  (1)  the  slope  of  the  regression  equation  must  approach  1.0  and 
the  intercept  must  approach  0.0;  and  (2)  the  variance,  and  hence  the  standard 
errors  of  the  estimate  of  the  slope  and  intercept,  must  approach  0.0.  If  these 
conditions  are  met,  then  the  estimator  is  consistent. 

The  regression  coefficients  and  the  standard  errors  are  reported  in  Table 
4.  The  results  indicate  that  when  both  the  number  of  items  and  the  number  of 
examinees  increase,  the  slope  and  intercept  coefficients  approach  1.0  and  0.0, 
respectively,  with  the  standard  errors  approaching  0.0.  This  tendency  is  evi¬ 
dent  for  both  ANCILLES  and  L0GIST  estimators  for  the  £,  b^,  and  £  parameters,  and 
for  0.  In  all  these  cases,  the  L0GIST  estimator  converged  In  probability  to  the 
true  value  more  rapidly  than  the  ANCILLES  estimator.  It  should  be  pointed  out, 
however,  that  the  results  reported  here  do  not  conclusively  support  this.  It  is 
clearly  necessary  to  examine  the  standard  errors  and  the  regression  coefficients 
with  a  greater  number  of  items  and  examinees. 

Discussion 


The  purpose  of  this  study  was  to  compare  two  methods  for  estimation  of  pa- 
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Figure  2 

Bias  in  the  Estimation  of  the  Difficulty 
Parameter  of  the  3-Parameter  Logistic  Model 
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rameters  in  the  3-parameter  logistic  model,  the  Urry  method  of  estimation,  and 
the  maximum  likelihood  procedure.  The  computer  programs  that  were  used  were  the 
ANCILLES  program  and  the  LOGIST  program  (Wood,  Wingersky,  &  Lord,  1978).  The 
efficiency  of  the  procedures  were  compared  with  respect  to  the  accuracy  of  esti¬ 
mation,  the  effect  of  violating  underlying  assumptions  (for  ANCILLES),  and  the 
statistical  properties  of  the  estimators.  The  factors  that  were  controlled  were 
test  length  (4  levels),  examinee  population  size  (3  levels),  and  ability  distri¬ 
bution  (3  levels). 

The  results  indicate  that,  in  general,  the  maximum  likelihood  procedure  was 
superior  to  the  Urry  procedure  with  respect  to  the  estimation  of  all  item  and 
ability  parameters.  The  differences  were  pronounced  in  the  estimation  of  the 
discrimination  and  chance-level  parameters,  but  with  respect  to  the  estimation 
of  ability  and  difficulty  parameters,  the  differences  were  less  remarkable. 
Differing  0  distributions  had  little  effect  on  the  estimation  of  b  and  0.  How¬ 
ever,  with  a  skewed  distribution  of  0,  ANCILLES  produced  poorer  estimates  of  a 
and  £  parameters  than  with  normal  or  uniform  0  distributions.  LOGIST,  although 
faring  better  than  ANCILLES  (with  the  exception  of  the  10-item  test),  produced 
slightly  poorer  results  with  the  skewed  distribution  than  with  the  normal  or 
uniform  distribution. 
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The  number  of  examinees  had  a  slight  effect  in  improving  the  accuracy  of 
estimation  of  the  b  and  c  parameters  and  0.  However,  increasing  the  number  of 
items  and  the  numb¥r  of  examinees  considerably  improved  the  accuracy  of  the  a 
estimates  with  both  procedures.  Surprisingly  enough,  a  20-item  test  with  1,000 
examinees  produced  excellent  estimates  of  the  b^  and  c^  parameters  and  reasonably 
good  estimates  of  a  and  9.  Tests  with  80  items  and  1,000  people  fared  consider¬ 
ably  better,  providing  good  estimates  of  all  parameters.  Tests  with  15  items  or 
less,  while  yielding  good  estimates  of  b  and  c_  parameters  and  reasonable  esti¬ 
mates  of  0,  yielded  poor  estimates  of  the  a_  parameter.  This  severely  limits  the 
application  of  the  3-parameter  latent  trait  model  to  criterion-referenced  mea¬ 
surement  situations,  since  criterion-referenced  tests  typically  have  fewer  than 
10  items.  However,  it  should  be  pointed  out  that  this  limitation  exists  only  if 
the  item  parameters  and  ability  parameters  are  estimated  simultaneously.  If 
item  banks  with  known  item  characteristics  are  employed  to  estimate  ability,  or 
if  the  1-parameter  model  is  employed,  this  limitation  may  not  ....  'st. 

Although  the  L0GIST  estimates  were  superior  to  the  ANCILLES  estimates,  es¬ 
pecially  in  the  case  of  short  tests,  the  difference  between  them  was  negli&ible 
when  the  number  of  items  and  the  number  of  examinees  increased.  This  is  of  par¬ 
ticular  importance,  since  ANCILLES  requires  considerably  less  computer  time  than 
L0GIST.  The  computer  time  taken  by  L0GIST,  especially  with  large  numbers  of 
items  and  examinees,  may  become  forbidding  enough  to  warrant  the  use  of  ANCILLES 
in  this  situation.  It  should  be  noted  that,  in  fairness  to  the  maximum  likeli¬ 
hood  procedure,  the  Urry  procedure,  in  general,  deletes  more  items  and  examinees 
during  estimation  than  does  the  maximum  likelihood  procedure.  This  may  explain 
the  rapidity  of  convergence  and  indicate  a  weakness  in  ANCILLES. 

The  bias  and  consistency  results  indicate  that  for  small  numbers  of  items, 
the  estimates  of  the  item  and  ability  parameters  are  biased,  with  the  ANCILLES 
more  biased  than  the  L0GIST  estimates.  As  the  number  of  examinees  and  the  num¬ 
ber  of  items  increase,  it  appears  that  the  estimators  are  unbiased  and,  in  fact, 
are  consistent.  This,  in  a  sense,  supports  a  conjecture  of  Lord  (1968)  and 
shows  that  the  3-parameter  model  may  be  statistically  viable. 
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Small  N  Justifies  Rasch  Methods 


Frederic  M.  Lord 
Educational  Testing  Service 


The  usual  Birnbaum  item  response  function  requires  determining  three  param¬ 
eters  for  each  item;  the  Rasch  model  requires  only  one.  If  there  is  only  a 
small  group  of  examinees,  the  a^  parameter  (the  discriminating  power)  cannot  be 
determined  accurately  for  some  of  the  items.  The  c  parameters  are  even  more  of 

a  problem.  For  small  samples,  is  it  perhaps  better  to  use  the  Rasch  model,  es¬ 

timating  only  one  parameter  per  item,  even  though  the  Rasch  model  is  incorrect? 

For  a  better  perception  of  the  problem,  consider  a  common  prediction  prob¬ 
lem  not  related  to  item  response  theory:  Suppose  it  is  desired  to  predict  vari¬ 
able  2.  from  measurements  on  five  predictors.  An  available  sample  has  been  used 
to  estimate  the  linear  regression  of  2  on  the  predictors.  This  regression  equa¬ 
tion  may  now  be  applied  to  estimate  for  new  individuals  drawn  from  the  same 
population.  If  the  sample  used  to  estimate  the  regression  equation  was  large, 
the  procedure  is  a  good  one;  but  if  this  sample  was  small,  the  procedure  may  be 

worse  than  simply  using  the  sample  mean  of  as  the  predicted  value  of  2  f°r 

each  new  individual.  Suppose,  for  example,  that  the  true  multiple  correlation 
for  predicting  2  was  -40.  If  the  sample  had  only  60  cases,  the  predictions  from 
the  sample  regression  equation  would  typically  be  no  more  accurate  than  a  pre¬ 
diction  that  each  new  value  of  will  fall  at  the  sample  mean  of  2* 

It  would  be  useful  to  know  how  large  the  sample  of  examinees  must  be  before 
it  is  worthwhile  to  use  a  2-  or  3-parameter  item  response  model  in  preference  to 
the  Rasch  model.  The  answer  to  this  question  will,  of  course,  depend  on  the 
purpose  to  be  served.  The  present  paper  is  a  modest  beginning:  it  only  answers 
this  question  for  the  2-parameter  logistic  model  and  only  for  one  very  limited 
situation.  The  purpose  of  this  paper,  then,  is  to  point  out  the  problem,  to 
indicate  a  method  of  solution,  and  to  provide  some  numerical  results,  indicating 
the  sample  size  required  when  there  is  no  guessing. 

Method 


Under  the  Rasch  model,  ability  must  be  estimated  by  some  function  of  the 
number-correct  score  x,  since  this  is  a  sufficient  statistic  under  this  model. 
Under  the  2-parameter  logistic  model,  ability  must  be  estimated  by  a  function  of 

n 

the  weighted  sum  Z  a  .u .  of  item  responses  ( Uj ) ,  the  weight  for  each  item  being 
.  ,  t  t 

t  =  1 


» 


— v» 
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the  item  discriminating  power  (a_^);  under  this  model,  this  weighted  sum  is  a 
sufficient  statistic  for  estimating  ability. 

Given  the  a^ ,  the  information  function  for  number-correct  score  x^  and  the 
information  function  for  the  weighted  sum  Ea,-  uj  can  be  readily  calculated  and 
compared.  The  weighted  sum  always  provides  more  information  than  the  number- 
correct  score  except  in  the  limiting  case  where  the  two  scores  are  identical  or 
proportional.  In  practice,  the  number-correct  score  perhaps  provides  up  to  95% 
as  much  information  as  the  weighted  sum. 


But  now  suppose  that  the  a^  are  not  known  but  are  only  estimated.  If  the 
estimates  a^  are  sufficiently  inaccurate,  the  weighted  sum  Eaj  u,  will  be  less 
informative  than  the  number-correct  score  x.  The  problem  is  to  make  a  precise 
statement  showing  how  the  usefulness  of  the  weighted  sum  Ea^u^  depends  on  the 
number  of  cases  used  to  determine  the  estimated  weights  _a^. 


It  is  desired  to  compare  jc  =  E^m  and  E \  a u  ■  as  estimators  of  ability. 
Note,  however,  that  expectations  over  the  u^  for  fixed  gives 


Sx  =  E.P. (0)  , 

and 

=  haipi^)  • 


m 

[2] 


where  P£(0)  is  the  2-parameter  logistic  item  response  function,  the  probability 
of  answering  the  item  correctly.  This  result  shows  that  each  scoring  method 
provides  an  unbiased  estimator  of  a  function  of  ability,  but  the  functions  esti¬ 
mated  are  not  the  same. 


A  comparison  must  be  made  between  a  function  of  jc  and  a  function  of  E.;  a,  u,- 
that  estimate  the  same  ability  parameter.  Moreover,  the  function  of  x  should  be 
independent  of  the  ,  since  the  estimation  of  ££  is  not  a  part  of  any  Rasch 
procedure.  This  will  be  done  as  follows: 

The  ability  parameter  to  be  estimated  will  be  considered  to  be  the  individ¬ 
ual's  true  number-correct  score: 

5  5  E^P^ ( 9 )  .  [3] 

Note  that  '•his  is  simply  a  specified  monotonic  transformation  of  ability  0. 

Since  x^  is  an  unbiased  estimator  of  £,  jc  is  clearly  the  ideal  statistic  under 
the  Rasch  model  for  this  purpose. 

The  optimal  estimator  of  £  under  the  2-parameter  model  is  not  x.  If  there 
is  no  prior  distribution  for  £,  an  optimal  estimator  is  the  function  of  the  suf¬ 
ficient  statistic  E^a^u^  that  is  unbiased  for  £.  This  function  is  uniquely  de- 


.V  Os*  “jk. 
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termined  by  the  Rao-Blackwell  theorem  (Kendall  &  Stuart,  1973,  sec.  17.35),  but 
it  is  too  complicated  for  practical  use  here. 

If  the  a^  and  the  item  difficulties  bj^  are  known,  an  asymptot ical ly  optimal 
estimator  of  £  under  the  2-parameter  model  is  the  maximum  likelihood  estimator 
(MLE), 

l  =  *  [4] 

where  0  is  the  MLE  of  0  when  the  a^  and  b^  are  known.  This  follows  from  the 
fact  that  the  MLE  of  a  given  function  of  a  parameter  is,  under  regularity  condi¬ 
tions,  the  same  function  of  the  MLE  of  the  parameter.  Moreover,  this  estimator 
EiPi(e)  is  actually  a  function  of  the  weighted  sum  E^a^u^,  since  the  MLE  is  al¬ 
ways  a  function  of  the  sufficient  statistic  if  such  exists.  In  fact,  0  is  the 
solution  of  the  likelihood  equation 

Wi(i)  '  151 

A 

Since  the  a^  and  b_^  are  not  known,  let  and  b^  ,  estimated  from  some  pre¬ 
viously  availanle  sample  of  examinees,  be  substituted.  Thus,  the  2-parameter 
estimator  to  be  compared  with  x  is 

n 

£  H  I  P  (0)  ,  (6] 

t-  1 

/N  ^  .  A. 

where  P£(0)  is  the  item  response  function,  with  and  b^  substituted  for  the 
unknown  true  item  parameters  and  0  is  the  solution  of 

Z .P  ■  (0)  -  E.a  .U  .  =  0  .  f  7 ] 

V  t  Z  Z  Z 

If  N  is  large  enough,  £  will  necessarily  show  the  same  advantage  over  x^  as 
does  the  weighted  sum  E;a_^u_£  when  the  a^  are  known.  But  what  if  N  is  small,  so 
that  the  a_^  and  b_^  are  erroneous  estimates? 

Since  x  and  £  are  both  consistent  estimators  of  the  same  ability  parameter 
(£),  they  are  properly  compared  by  their  mean  squared  errors  (MSE).  The  exact 
sampling  variance  of  x  is 

n 

Var(x)  =  E  Pi(0)«i(0)  •  [8] 

i  =  1 

Since  x  is  unbiased  for  £,  this  is  also  the  exact  MSE. 

The  sampling  variance  of  £  depends  on  £,  the  true  score  of^the  examinee 
whose  true  score  is  to  be  estimated.  Given  £,  the  variance  of  £  arises  from  two 
sources : 
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1.  Sampling  fluctuations  in  the  data  on  II  examinees  used  to  estimate  the 

a .  and  the  b  .  , 

—l  —i 

2.  Sampling  fluctuations  in  the  responses  (u^a)  of  examinees  at  the  given 

true-score  level  £,. 

The  examinee  whose  true  score  is  to  be  estimated  is  not  included  in  the  sample 
of  N  examinees;  thus,  the  second  source  of  error  is  independent  of  the  first. 

It  does  not  seem  feasible  to  obtain  the  exact  MSE  of  C  when  N_  <  °° ;  conse¬ 
quently,  the  present  study  deals  only  with  its  asymptotic  variance,  which  is 
equal  to  its  asymptotic  MSE.  Formulas  for  calculating  the  asymptotic  sampling 
variance  are  given  in  the  Appendix. 


Table  1 

Item  Serial  Numbers  and  Item 
Parameters  for  All  Tests  Studied 


Item 

Serial  No. 

Item 

Parameters 

a 

b^ 

3 

1.6 

-1.9 

4 

1.7 

-1.5 

5 

0.8 

-1.7 

8 

1.3 

-1.7 

9 

0.4 

0.5 

10 

1.1 

-1.3 

13 

1.4 

-1.2 

14 

0.9 

-1.1 

15 

0.6 

-1.9 

18 

1.2 

-1.0 

19 

1.6 

-0.9 

20 

0.6 

-0.4 

23 

0.6 

-1.3 

24 

0.5 

-1.4 

25 

0.9 

-0.9 

28 

1.8 

-0.9 

29 

0.9 

-0.8 

30 

0.5 

0.3 

33 

0.7 

-0.8 

34 

0.7 

-0.4 

35 

1.0 

-0.2 

38 

0.8 

0.0 

39 

1.1 

-0.4 

40 

0.8 

0.1 

43 

0.5 

0.8 

44 

1.1 

-0.3 

45 

0.7 

0.3 

48 

0.6 

0.8 

49 

0.4 

0.3 

50 

0.7 

0.9 
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Test  Studied 


Numerical  results  can  only  be  obtained  for  particular  numerical  values  of 
the  item  parameters  a^  and  b^.  The  following  procedure  was  used  in  the  hope  of 
obtaining  realistic  numerical  values. 

The  responses  of  3,000  6th-grade  students  to  a  50-item  Metropolitan  (MAT) 
vocabulary  test  were  analyzed  by  LOGIST.  Since  (for  simplicity)  the  present 
study  was  limited  to  the  2-parameter  model,  all  c  parameters  were  held  at  0. 

The  a^  and  b^  obtained  were  used  as  true  item  parameters  for  the  tests  to  be 

studied  here.  These  item  parameters  are  listed  in  Table  1. 

Table  2  shows  how  4  different  10-item  tests  are  defined  in  terms  of  the 
items  listed  in  Table  1.  Tests  3,  4,  and  5  are  nonoverlapping  spaced  samples  of 
items.  Since  the  items  in  Table  1  are  arranged  roughly  in  order  of  difficulty, 
in  Table  2  test  difficulty  tends  to  increase  from  top  to  bottom.  Table  2  also 
shows  for  each  test  the  true  test  score  C  that  corresponds  to  specified  values 
of  0.  Remember  that  for  any  given  test,  £  and  0  are  equivalent  measures  of  the 
same  ability,  differing  only  in  scale. 

Table  2 

True  Score  (£)  Equivalent  to  Specified  Ability  Levels  (0) 
for  Four  10-Item  Tests 


Specified  Values  of  0 


Test 

Items  in  Test 

-2 

-1 

0 

1 

2 

3 

3,  8,  13,  18,...,  48 

1.8 

4.8 

7.4 

8.7 

9.4 

4 

4,  9,  14,  19,...,  49 

1.5 

4.1 

7.1 

8.7 

9.3 

5 

5,  10,  15,  20,...  ,  50 

1.7 

3.8 

6.3 

8.2 

9.3 

IB 

10,  10,  20,  20,...,  50,  50 

1.2 

3. 1 

5.4 

7.4 

8.8 

Results 


Number-correct  score  x  is  an  unbiased  estimator  of  £.  On  the  other  hand,  £ 
is  only  asymptotically  unbiased.  The  exact  small-sample  bias  of  £  was  calculat¬ 
ed  for  10-item  Test  IB  and  for  5-item  Test  1A,  parallel  to  Test  IB  except  for 
length.  (The  method  used  for  computing  S(£  “  S|?)  is  entirely  parallel  to  the 
method  for  computing  Var(C|C)  described  in  the  Appendix.)  Test  IB  consisted  of 
2  items  exactly  like  Item  10,  2  like  Item  20,  and  so  forth,  for  a  total  of  10 
items.  Test  1A  consisted  simply  of  Items  10,  20,  30,  40,  and  50. 

Table  3  compares  the  bias  of  these  two  tests  that  differed  only  in  length. 
The  bias  was  small,  even  for  five-item  tests.  The  true  score  £  of  Test  IB  was 
exactly  double  the  true  score  of  Test  1A,  but  the  bias  in  £.  increased  more  mod¬ 
estly,  if  at  all,  as  the  test  length  was  doubled. 

Table  4  shows  the  exact  small-sample  variance  of  E,  when  the  item  parameters 
werp  known,  determined  from  an  infinitely  large  sample  of  examinees.  In  this 


— 


-  391  - 


Table  3 

Bias  (gf  -  ^ )  in  True  Score 
Estimate  £  for  Tests  1A  and  IB, 
Which  Were  Parallel  Except  for 
Length,  When  Item  Parameters 
Were  Known  (N  =  ®) 


Test 


6 

1A 

(n=5) 

IB 

(n=10) 

-2 

-.028 

-.035 

-1 

.029 

.037 

0 

.045 

.047 

1 

.029 

.026 

2 

.020 

;20 

table  Tests  1A,  IB,  and  1C,  which  were  parallel  except  for  length,  are  compared. 
Test  1C  contained  3  items  like  Item  10,  3  like  Item  20,  and  so  forth,  for  a 
total  of  15  items.  As  might  be  expected,  the  sampling  variance  increased  almost 
exactly  as  test  length  increased. 


Table  4 


Variance  (Equation  A2)  of  True  Score 
Estimate  £  When  Item  Parameters  Were 
Known  (N  =  °°)  for  Tests  1A,  IB,  1C, 
Which  Were  Parallel  Except  for  Length 

6 

Test 

1A 

(n=5) 

IB 

(n=10) 

1C 

(n=15) 

-2 

.52 

1.01 

1.50 

-1 

.92 

1.75 

2.58 

0 

.93 

1.87 

2.82 

1 

.79 

1.60 

2.40 

2 

.47 

.95 

1.44 

As  noted  previously,  the  optimal  estimator  of  £  is  the  function  of  a,  uj 
that  is  unbiased  for  £.  Since  the  desired  function  (which  can  be  found  by  the 
Rao-Blackwe 1 1  theorem)  is  impractical  to  use,  the  consistent  estimator  £  was 
used.  The  MSE  is  equal  to  the  variance  plus  the  square  of  the  bias.  For  Tests 
1A  and  IB,  it  can  be  seen  from  Tables  3  and  4  that  the  MSE  differed  from  the 
variance  of  £  only  in  the  third  decimal  place.  Table  5  compares  the  variance  of 
x  with  the  variance  of  £.  Replacing  variance  by  MSE  would  not  change  the  pic¬ 
ture.  The  relative  efficiency  of  two  consistent  estimators  is  asymptotically 
proportional  to  the  ratio  of  their  sampling  variances.  A  comparison  of  the  last 
2  columns  of  Table  5  shows  that  the  efficiency  of  x  ranged  from  .85  at  0=0  and 
0  =  -1  to  .93  at  0  =  -2. 
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Table  5 

Variance  (Equation  A2)  of  True  Score  Estimate  £  for  Test  3, 
as  a  Function  of  the  Sample  Size  (N)  Used  to  Estimate  the 
Item  Parameters;  Also  Variance  (Equation  8) 
of  Number-Correct  Score  x 


0 

K 

N=100 

Var(£|C)  when 
N=300  N=1 ,000 

N=  ® 

Var  (xls) 

-2 

1.8 

1.32 

1.23 

1.20 

1.19 

1.28 

-1 

4.8 

1.84 

1.76 

1.73 

1.72 

1.90 

0 

7.4 

1.18 

1.13 

1.12 

1.11 

1.30 

1 

8.7 

.78 

.74 

.73 

.72 

.85 

2 

9.4 

.47 

.45 

.44 

.44 

.50 

Interpolating  in  Table  5,  it  can  be  seen  that  for  9  =  -2,  jc  was  better  than 
£  when  the  item  parameters  were  estimated  from  a  sample  with  N  <  200,  to  a  rough 
approximation;  £  was  better  than  x_  when  N  >  200.  It  can  be  said,  therefore, 
that  N=200  is  the  critical  sample  size.  For  the  other  tabled  0  values,  the 
critical  sample  size  is  in  each  case  less  than  100. 

The  critical  N's  for  Test  3  are  listed  in  Table  6  along  with  similar  values 
for  Tests  4,  5,  1A,  IB,  and  1C.  Because  of  the  heavy  cost  in  computer  time,  no 
runs  were  made  for  15-item  tests  other  than  Test  1C.  It  appears  that  for  the 
10-  and  15-item  tests,  the  Rasch  estimator  jc  may  be  slightly  superior  to  the 
2-parameter  estimator  £  when  the  number  of  cases  available  for  estimating  the 
item  parameters  is  less  than  100  or  200.  This  is  the  main  conclusion  of  the 
study. 


Table  6 

Approximate  Number  of  Cases  (N)  Required  for  £ 
To  Have  a  Smaller  Sampling  Variance 
Than  Number-Correct  Score  x 


Test 

0 

1A 

3 

4 

5 

IB 

1C 

-2 

700 

200 

<100 

300 

250 

200 

-1 

3000 

<100 

<100 

250 

200 

100 

0 

<100 

<100 

<100 

200 

<100 

<100 

1 

100 

<100 

<100 

150 

150 

200 

2 

100 

<100 

<100 

100 

250 

250 

Conclusions 


This  study  has  been  limited  to  a  comparison  of  1-parameter  (Rasch)  and 
2-parameter  estimators  of  the  examinee's  true  score.  Similar  studies  should  be 
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made  for  the  3-parameter  model.  Estimators  of  other  quantities,  such  as  item 
difficulty,  should  also  be  compared.  The  same  approach  can,  in  principle,  be 
applied  to  determine  the  relative  effectiveness  of  the  Rasch  and  other  models 
for  test  equating  and  other  practical  purposes;  however,  the  computational  bur¬ 
den  of  doing  this  may  prove  to  be  excessive. 
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/PPENDIX 


Asymptotic  Sampling  Variance  of  £ 

By  a  standard  formula  from  analysis  of  variance,  the  error  variance  of  4 
for  fixed  £  can  be  written 

Var(?|C)  =  S^[VaraU,u)|U  +  Var^  [  S<£  U  ,«)  U  1  •  tA1] 

where  8U  and  Varu  are  taken  across  all  possible  response  vectors  u  =  ju^[ .  To 
understand  the  terms  in  brackets,  note  that  when  u  is  fixed,  the  only  other 
source  of  variability  is  sampling  error  in  the  estimation  of  the  ££  and  the  b  £ . 
Remember  that  the  a.£  and  bj^  are  obtained  from  a  sample  of  N  examinees  and  that  u 
belongs  to  an  examinee  who  is  not  part  of  that  sample. 

For  large  N  the  last  term  in  Equation  A1  is  adequately  approximated  by  re¬ 
placing  the  estimates  and  b£  by  their  true  values  a^  and  b£  ;  in  other  words, 

8(C|C,u)  can  be  replaced  by  8(4|4,u).  By  Equations  4  and  5,  fixing  u  also  fixes 
4 ,  so  how  8(4 | 4, -u)  =  4.  Thus,  the  last  term  in  Equation  A1  becomes  Varu(4|C). 

By  Equation  3  whenever  4  is  fixed,  9  is  fixed  also;  so  Equation  A1  can  be  writ¬ 
ten  approximately 

Var  (4 | 4)  =  8u[Var(4|0,u) |9]  +  Varu(4|0)  .  [A2] 

The  first  variance  on  the  right  arises  from  sampling  fluctuations  in  the  a^ 
and  the  t>£ ;  the  second  variance  is  independent  of  these  fluctuations.  The  sec¬ 
ond  variance  can  be  evaluated  as  follows: 

1.  For  each  possible  item  response  pattern  u,  determine  8  by  solving  Equa¬ 

tion  5  numerically. 

2.  For  each  8  from  Step  1,  compute  4  =  E^P^B). 

3.  For  each  u,  compute 
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Prob (u | 6 ) 


n  u 

n  P .  (  0  )  ^ 
i=  1  1 


1  -  u. 
3^(9)  * 


[A3] 


for  the  given  values  of  0  (not  0).  This  gives  the  frequency  distribu¬ 
tion  of  u. 

4.  Compute  the  variance  for  given  0  of  the  £  obtained  in  Step  2,  taken 

over  the  frequency  distribution  of  u  found  in  Step  3.  The  result  is 
Varu(£ | 0 ) ,  as  required. 

5.  Repeat  the  foregoing  for  different  given  values  of  0 ,  as  desired. 

Although  the  notation  does  not  make  it  explicit,  the  results  obtained  de¬ 
pend,  of  course,  on  the  _a^  and  bj^  of  the  items  in  the  test  being  studied.  A 
separate  study  must  be  carried  out  for  each  different  test.  Because  of  the  num¬ 
ber  of  calculations  required,  practical  considerations  limit  investigation  to 
tests  not  much  longer  than  15  items. 

It  remains  to  evaluate  the  first  term  on  the  right  of  Equation  A2.  The 
quantity  Var(f|0,u)  will  be  evaluated  by  the  delta  method.  Let  £gi  denote  the 

partial  derivative  of  t  with  respect  to  a^,  and  similarly  for  b^  and  9.  The 
total  derivative  of  £  is 


di  =  +  i.v  <za,  +  z.c;  di.  .  U4j 

e  1  di  *  %  bi  'L 

but  now,  however,  by  Equation  7,  8  is  itself  a  function  of  the  ard  the  bj . 
Denoting  the  left  side  of  Equation  7  by  l,  the  total  derivative  of  Equation  7  is 

KdQ  +  E-ST  dai  +  dS  .  =  0  ,  [A5] 

0  di  bi 

/  A  A 

where  is  the  partial  derivative  of  l  with  respect  to  a^ ,  and  similarly  for 

A  A  ...  A  9 

b.[  and  0.  Eliminating  d0  from  Equations  A4  and  A5  gives 


sT d%  =  E  .(I'z'  ~  Z'l'  )dd.  +  E )dbi 


9  di  0  ai 


v  v 


0  bi  0  bi 


When  the  delta  method  is  applied  to  Equation  6,  it  is  found  that 


Var (£ 1 9 , u) 


2  [  E^C  Var  (d ^  \  d  ,u)  } 

0 


+  i'bi)  Varca^le.w)} 

+  2  h  -  ^0x'k) 

Cov(ai,bi\Q ,u)})  . 
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For  given  0  and  u,  the  variance  needed  for  the  first  term  on  the  right  of 
Equation  A2  can  be  computed  from  Equation  A7.  The  necessary  derivatives  are 


H  =  ziaivi  » 

[A8] 

Hi  =  <9  -  bi)H  * 

[  A9  ] 

Hi  =  ~ai\  > 

t  A10  ] 

lQ  =  ■ 

[All] 

l'  .  =  u  •  -  P.  -  a.  (9  -  b.)ir-  , 
ai  t  i  i  t  i 

[A12]. 

2 

V.  .  =  +a  .nl 

01  11 

[  A13  ] 

where  =  DP^Qj  denotes  the  derivative  of  with  respect  to  L£  =  aj(0“bj)i  and 
D  =  1.7.  The  variance-covariance  matrix  of  a^  and  b^ ,  needed  in  Equation  A7,  is 
found  by  inverting  the  Fisher  information  matrix: 


N 

D7-  Z  (6  -b.)2P.  Q. 
.a  t  ia  ia 
a  =  l 


N 

- Dza .  Z  (0  -b.)P.  Q. 
i  ,  a  i  ia  ta 
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where  P,  =  P  (0  ). 
la  i  a 

To  evaluate  Equation  A2  for  fixed  6,  compute  Var(||0,u)  by  Equation  A7  sep¬ 
arately  for  each  u.  Then,  take  the  average  of  these  values  across  u  (weighting 
each  value  by  Prob(u|0)  given  by  Equation  A3)  to  find  the  first  term  on  the 
right.  Add  on  the  second  terra  Varu(l|0),  computed  as  described  earlier.  The 
resulting  Var(^|^)  must  be  computed  separately  for  each  £  or  0  of  interest. 

(Note  again  that  fixing  £  is  equivalent  to  fixing  0,  because  of  Equation  3.) 


Discussion:  Session  8 


Bert  F.  Green,  Jr. 

Johns  Hopkins  University 


The  papers  presented  in  this  session  seem  to  have  been  done  competently  and 
to  have  given  reasonable  results.  I  should  like,  however,  to  put  their  results 
in  some  perspective. 

Why  is  latent  trait  theory  attractive?  It  promises  to  deliver  a  scale  that 
is  essentially  invariant  over  different  item  selections;  therefore,  the  measure¬ 
ment  scale  provided,  the  0  scale,  is  paramount.  One  feature  of  that  scale  is, 
of  course,  that  its  zero  point  and  unit  are  arbitrary.  In  isolated  experiments 
there  must  be  some  way  of  specifying  the  location  and  unit  for  0.  The  usual 
procedure  is  to  fix  the  mean  at  0  and  the  standard  deviation  at  1.  Waller  ap¬ 
pears  to  claim  that  this  is  not  enough:  When  the  original  0  distribution  is 
badly  skewed,  there  appears  to  be  severe  bias  in  the  estimation  of  the  item  pa¬ 
rameters  . 

Part  of  the  problem  is  readily  solved  with  a  scale  adjustment.  Waller's 
original  distribution  of  0  had  a  mean  of  -.83  and  a  standard  deviation  of  1.18. 
The  original  values  of  the  item  parameters  had  average  difficulty  of  0  and  aver¬ 
age  discr iminability  of  1.0  on  this  scale.  Yet,  the  L0G0G  computer  program  sets 
the  mean  of  the  ability  distribution  to  0  and  the  standard  deviation  to  1  and 
reports  estimates  of  the  item  parameters  on  that  scale.  If  Waller  had  trans¬ 
formed  the  original  item  parameters  to  correspond  with  a  standardized  0  scale, 
they  would  have  had  an  average  difficulty  of  .83  and  an  average  discr iminability 
of  1.18  (assuming  arithmetic  averages).  In  fact,  Waller  observed  an  average 
difficulty  of  .90  and  an  average  discr iminability  of  1.27.  Thus,  most  of  the 
difference  seems  to  be  artificial  and  to  be  due  to  a  scale  shift. 

How  much  of  the  remaining  difference  is  bias  and  how  much  is  sampling  er¬ 
ror?  Since  only  one  sample  of  480  pseudo-cases  and  45  pseudo-items  were  tried, 
there  is  no  way  to  tell:  One  sample  does  not  make  a  monte  carlo  study.  Swami- 
nathan  and  Gifford  did  a  similar  study  vith  a  negatively  skewed  distribution 
(Waller's  was  positively  skewed)  and  obtained  difficulties  and  discrimlnabili- 
ties  that  were  too  high.  This  would  seem  to  be  consistent  for  the  discrimina- 
bilities  and  to  be  inconsistent  for  the  difficulties. 

If  the  0  scale  is  important,  then  its  metric  is  important,  in  which  case 
why  did  Hamblrton  and  Cook  use  rank-order  correlations  to  evaluate  correspond¬ 
ence  of  0  and  9?  Product-moment  correlation  would  seem  to  be  the  obvious 
choice.  They  claimed  that  the  scale  of  0  is  arbitrary.  The  origin  and  unit  are 
arbitrary,  but  the  metric  is  not.  If  the  metric  were  arbitrary,  of  what  value 
is  latent  trait  theory? 
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Some  investigators  believe  that  the  0  scale,  at  least  the  maximum  likeli¬ 
hood  6  scale,  is  unfit  for  linear  statistical  methods.  If  so,  then  latent  trait 
theorists  have  an  inferior  product.  The  problem  is  at  the  extremes,  where  abil¬ 
ity  estimates  can  have  huge  standard  errors.  Lord  advocates  transforming  back 
to  the  true  score  scale,  which  I  thought  was  what  we  were  escaping,  whereas  No- 
vick  advocates  (I  suppose)  Bayesian  estimation.  Bayesian  estimates  have  no 
problem,  because  an  infinite  value  of  §  has  an  infinitesimal  a  priori  probabili¬ 
ty,  so  that  only  an  infinitely  perverse  examinee  will  give  any  trouble.  There 
are  other  possibilities.  Why  not  refuse  to  give  a  score  to  an  extreme  person? 

Of  course,  an  adaptive  test  with  an  adequate  supply  of  items  would,  at  least  in 
principle,  be  in  a  much  better  position.  Since  in  such  a  test  the  item  diffi¬ 
culties  match  the  person's  ability,  this  "end  effect"  should  be  a  much  smaller 
problem.  One  way  or  another,  though,  this  end  problem  needs  to  be  resolved. 
There  is  no  future  for  test  scores  that  are  unsuited  to  linear  statistical  meth¬ 
ods. 


Hambleton  and  Cook  and  Swaminathan  and  Gifford  have  studied  the  properties 
of  estimates  of  ability  and  the  item  parameters  as  functions  of  the  number  of 
examinees,  the  number  of  items,  and  the  true  distribution  of  ability.  They  used 
"constructed"  data  and  the  monte  carlo  approach.  Note  carefully  that  each  tabu¬ 
lar  entry  is  based  on  only  one  sample  data  matrix.  Although  the  entries  are 
averages  over  items  and  persons,  in  a  real  sense,  each  of  the  entries  represents 
one  sample  point.  Thus,  individual  entries  are  not  to  be  relied  upon;  only  gen¬ 
eral  trends  should  be  interpreted. 

Hambleton  and  Cook  evaluated  the  fit  of  the  1-,  2-,  and  3-parameter  models 
to  data  from  each  of  these  three  models  with  a  uniform  distribution  of  ability. 
They  also  compared  the  lower  and  upper  halves  of  the  ability  distributions. 

With  only  20  items,  the  1-  and  2-parameter  models  were  poorer  in  the  lower  half 
than  in  the  upper  half  of  the  ability  distribution.  When  the  entire  ability 
distribution  was  analyzed,  40  items  were  slightly  better  than  20,  and  ability 
was  better  estimated  when  there  was  no  guessing.  All  three  models  fit  a  given 
data  matrix  almost  equally  well,  but  apparently  this  is  a  very  good  set  of 
"items."  (Roughly  the  same  pattern  of  results  was  found  for  a  normal  distribu¬ 
tion  of  ability,  but  all  values  were  smaller.) 

Much  more  interesting  are  their  results  concerning  the  standard  errors  of 
estimate  as  a  function  of  ability.  Clearly,  a  10-item  test  gives  unsatisfactory 
standard  errors;  a  20-item  test  is  not  very  good  for  low  abilities;  but  an  80- 
item  test  gives  nearly  constant  standard  errors  for  abilities  in  the  range  -2.0 
to  +2.0.  The  typical  great  increase  of  standard  error  occurs  for  more  extreme 
scores.  (This  is  the  unfortunate  problem  with  the  ability  metric  that  was  men¬ 
tioned  above.) 

I  would  like  to  know  not  only  how  well  each  method  did  relative  to  the  true 
values  but  also  how  the  methods  compared  with  each  other.  What  are  the  correla¬ 
tions  of  0  with  0  for  the  1-,  2-,  and  3-parameter  models?  Almost  certainly, 
they  were  extremely  high. 

Swaminathan  and  Gifford  compared  two  estimation  procedures  and  found  L0G1ST 
to  be  superior.  They  also  showed  that  as  both  the  number  of  items  and  the  num- 
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ber  of  persons  increased  jointly,  the  LOGIST  parameter  estimates  approached  the 
true  values  without  bias,  indicating  empirical  consistency.  This  demonstration 
is  heartening  but  would  be  more  convincing  with  more  data  sets,  i.e.,  more  rep¬ 
lications. 

Swaminathan  and  Gifford  showed  that  Urry's  method  had  trouble  estimating 
the  guessing  parameter.  It  would  be  interesting  to  know  if  the  other  problems 
with  the  method  were  related  to  this  flaw.  Why  not  estimate  a  single  guessing 
parameter  for  all  items,  or  at  least  for  all  items  of  a  given  type?  Or,  if 
there  are  few  enough  items,  why  not  set  c  -  .20  or  .25,  or  whatever  seems  empir¬ 
ically  reasonable,  and  only  estimate  the  other  two  parameters  for  each  item? 

Note  that  Hambleton  and  Cook  and  Swaminathan  and  Gifford  asked  how  large  N 
and  n  should  be.  By  contrast,  Lord  asked  which  procedure  should  be  used  if  N  is 
small.  He  reasoned,  and  found,  that  for  an  N  small  enough  (roughly  100)  the 
1-parameter  model  was  actually  superior.  Given  the  recent  work  on  equal  weights 
in  regression,  that  result  must  inevitably  be  so.  Empirically  determined 
weights  are  uncertain  with  small  N. 

Hambleton  and  Cook  claimed  that  samples  of  200  persons  and  20  items  are 
satisfactory  for  some  applications  of  latent  trait  theory.  It  is  very  important 
that  their  conclusions  be  noted  carefully  and  that  that  claim  not  be  over-gener¬ 
alized.  Certainly,  when  the  model  fits  the  data,  the  item  parameters  c  m  be 
adequately  estimated.  The  ability  parameters  can  also  be  estimated,  but  the 
standard  errors  are  large,  and  the  extreme  cases  are  still  a  problem.  A  stan¬ 
dard  20-item  test  will  not  give  very  reliable  results,  no  matter  what  theory  is 
used.  And  80  items  and  a  great  many  examinees  would  be  very  much  better  than  20 
items  and  only  200  examinees. 

How  realistic  are  these  studies?  First,  all  of  them  used  data  constructed 
by  monte  carlo  methods.  Lord  based  his  theoretical  item  parameters  on  those 
from  an  actual  data  set — a  30-item  subset  from  a  50-item  vocabulary  test.  The 
range  of  item  discrimination  indices  was  .4  to  1.8,  with  quartiles  of  .55,  .83, 
and  1.35.  Swaminathan  and  Gifford  used  a  uniform  distribution  in  the  range  .6 
to  2.0,  a  distinctly  better  set  of  items.  Hambleton  and  Cook  used  two  ranges — 
.5  to  1.74  and  .81  to  1.43 — much  like  Lord's  set.  All  of  these  are  good  items, 
with  excellent  discrimination.  Also,  the  items  were  constructed  to  be  unidimen¬ 
sional.  What  happens  with  items  of  more  ordinary  discrlminability  and  with  some 
secondary  group  factors? 

Secondly,  how  often  will  the  model  be  applied  when  item  parameters  are  un¬ 
known?  Is  it  not  at  least  as  likely  that  calibrated  items  will  be  available 
from  which  only  ability  needs  to  be  estimated?  Suppose  a  few  uncalibrated  items 
are  being  pretested;  item  parameters  are  to  be  estimated  in  the  context  of  the 
calibrated  items  and  the  estimated  ability  scores.  Most  especially,  how  does 
this  kind  of  conditional  estimation  proceed  in  a  computerized  adaptive  testing 
environment.  This  seems  a  good  place  to  apply  sequential  Bayesian  procedures. 
Finally,  what  happens  with  real  data?  Simulation  studies  have  their  place,  but 
much  more  is  to  be  learned  with  real  data. 
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All  too  often,  thinking  and  formation  of  concepts  in  behavioral  science 
have  been  misled  by  readily  available  statistical  methods,  especially  of  the 
multivariate  variety.  A  typical  example  of  how  theorizing  in  psychology  can  be 
led  astray  by  statistical  methods  is  the  obsolete  dispute  on  the  percentage  of 
genetically  versus  environmentally  determined  intelligence.  In  this  field  there 
has  been  an  unwavering  attempt  to  apply  models  of  variance  decomposition,  which 
had  been  developed  for  breeding  experiments  in  stock-farming  and  which  are  ap¬ 
propriate  for  that  purpose;  however,  these  methods  are  not  suited  for  yielding 
scientific  insights  into  the  genetic  and  environmental  factors  of  human  intelli¬ 
gence.  On  the  other  hand,  there  has  been  a  failure  to  develop  adequate  methods 
for  answering  the  question,  What  is  the  effect  of  specified  types  of  socioeco¬ 
nomic  environment  on  the  development  of  human  intelligence? 

However,  it  is  methodology  that  must  be  adjusted  to  the  theoretical  con¬ 
cepts  and  problems  in  applied  behavioral  science,  rather  than  the  reverse.  This 
is  illustrated  by  an  example  from  communication  research:  In  1971  a  basic  prob¬ 
lem  of  market  and  opinion  research  was  posed,  namely,  What  is  the  effect  of  an 
insertion  in  different  media,  such  as  television,  radio,  or  newspapers?  For  the 
practical  purpose  of  optimizing  a  campaign  with  a  limited  budget,  a  simple  an¬ 
swer  to  the  question  was  needed,  e.g.,  "An  insertion  in  television  is  three 
times  as  effective  as  a  comparable  insertion  in  a  local  radio  program."  In  addi¬ 
tion,  it  seemed  that  it  was  chiefly  the  methods  currently  used  in  communication 
research  that  were  responsible  for  the  lack  of  generalizable  results  on  communi¬ 
cation  effects.  The  problem  was  as  follows:  Suppose  it  were  possible  to  de¬ 
scribe  the  effectiveness  of  each  medium  by  just  one  quantitative  parameter;  sup¬ 
pose  further  that  each  interviewed  person  could  be  characterized  by  certain  at¬ 
titude  parameters  pertaining  to  the  topic  of  the  campaign  and  by  the  subject's 
individual  amount  of  consumption  of  each  medium.  What  kind  of  probabilistic 
model  would  then  give  a  straightforward  answer  to  the  simple  question,  What  is 
the  effect  of  medium  relative  to  the  effect  of  medium  k? 

At  the  same  time,  for  theoretical  as  well  as  for  practical  reasons,  there 
was  an  attempt  to  comply  with  the  principle  of  specific  objectivity,  as  intro¬ 
duced  by  Rasch  (1967,  1972):  The  comparison  of  the  effect  parameters  of  two 
media  should  depend  on  these  two  parameters  only  and  should  be  independent  of 
any  irrelevant  factors,  such  as  the  parameters  characterizing  the  initial  atti¬ 
tudes  of  the  respondents.  In  other  words,  the  result  should  be  independent  of 
the  sample. of  respondents. 
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These  considerations  resulted  in  a  family  of  logistic  models  that  are 
closely  related  to  the  well-known  Rasch  (1960,  1966)  models  but  also  show  some 
marked  distinctions.  Unfortunately,  the  models  have  been  applied  to  assessing 
effects  of  mass  communication  only  once;  but  many  problems  in  clinical  and  edu¬ 
cational  psychology  are  of  similar  structure,  the  media  being  replaced  by  thera¬ 
peutic  or  educational  treatments.  A  considerable  number  of  applications  in 
these  fields  have  been  undertaken  in  the  last  five  years,  and  the  theoretical 
and  methodological  bases  of  the  models  have  been  further  strengthened. 

As  can  easily  be  seen,  the  question  regarding  the  effects  of  mass  communi¬ 
cation  is  nothing  but  a  special  case  of  the  question  of  change  under  the  influ¬ 
ence  of  some  sort  of  treatment.  Therefore,  the  models  referred  to  are  of  con¬ 
siderably  broad  interest.  Their  distinction  from  more  conventional  approaches 
to  measurement  of  change  is  that  the  data  are  regarded  as  what  the  observations, 
in  fact,  mostly  are:  qualitative  variables.  In  this  paper  it  will  by  no  means 
be  attempted  to  scale  or  to  quantify  the  data  in  order  to  make  the  classical 
statistical  methods  applicable.  Quite  the  contrary,  the  observations  will  be 
explained  as  realizations  of  qualitative  random  variables,  which  are,  however, 
governed  by  quantitative  latent  parameters.  Change  is  defined  as  a  change  in 
these  latent  parameters. 

Models  for  Qualitative  Data 


There  are  a  variety  of  such  models,  differing  as  to  the  restrictiveness  of 
their  assumptions,  the  kind  of  results  deducible,  and  the  required  types  of 
data.  The  most  important  models  are: 

1.  The  dichotomous  linear  logistic  test  model  (LLTM),  which  was  originally 
devised  for  analyzing  the  complexity  of  intelligence  test  items  in 
terms  of  cognitive  operations  involved,  but  is  also  useful  for  measur¬ 
ing  change  in  unidimensional  latent  variables  or  for  certain  experimen¬ 
tal  designs  with  more  than  two  points  of  time.  Since  the  formalism  of 
this  model  is  rather  complicated,  it  will  not  be  dealt  with  here  (see 
Fischer,  1973,  1974a,  1974b,  1977a). 

2 .  The  dichotomous  linear  log'stic  model  with  relaxed  assumptions  (LLRA), 
which  emphasizes  the  relaxation  of  assumptions  as  compared  with  the 
usual  latent  trait  models,  since  no  unidimensionality  of  the  criterion 
variables  or  items  is  assumed.  It  has  proven  a  very  useful  tool  for 
assessing  change  in  a  variety  of  different  situations  and  will  be  de¬ 
scribed  in  this  paper. 

3.  The  polychotomous  extension  of  the  LLTM,  for  which  applications  are 
lacking.  Since  this  paper  will  not  dwell  on  purely  theoretical  devel¬ 
opments  that  have  not  as  yet  stood  the  test  of  practical  application, 
this  model  will  merely  be  mentioned  (see  Fischer,  1974a,  1974b,  1977c). 

4.  The  polychotomous  generalization  of  the  LLRA,  offering  quite  interest¬ 
ing  possibilities  of  application  and  empirical  hypotheses  testing, 
which  will  be  mentioned  below. 
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The  Dichotomous  LLRA 


The  Model 


"Dichotomous"  means  that  the  observed  criterion  variables,  which  may  be 
test  items  or  clinical  symptoms  or  any  other  kind  of  behavior,  are  binary  vari¬ 
ables.  It  is  assumed  that  before  and  after  treatment  a  number  of  _k  such  crite¬ 
rion  variables  are  observed  on  each  subject.  Then,  the  model  is  defined  by  the 
following  equations: 
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Thereby,  P(  +  |v>i»tn )  denotes  the  probability  that  subject  v_  gives  response  "+" 
in  criterion  T  at  time  t^  (before  treatment)  and  that  P(  +  Jv_,_i_  )  is  the  analo¬ 
gous  probability  for  time  t^  (after  treatment).  The  probability  PC-*-  ^ 

depends  solely  on  one  parameter,  £y£.  For  example,  let  criterion  i_  be  a  certain 

symptom  of  fear  in  clinical  patients,  then  is  the  latent  anxiety  of  subject 

v  behind  that  symptom.  Thus,  the  state  of  subject  v^  at  time  t_2  is  characterized 
by  a  vectorial  parameter  £y  *  (£  , . . .  ,5^)  >  in  other  words,  by  a  set  of  It 

traits  associated  with  the  k  criterion  variables. 


Note  that  the  model  makes  no  assumptions  whatsoever  about  interdependencies 
or  dimensionality  of  these  traits;  in  particular,  unidimensionality  of  the  cri¬ 
teria  or  items  is  not  assumed,  as  would  be  the  case  with  the  Rasch  (1960,  1966) 
or  Birnbaum  (1968)  models.  Hence,  the  LLRA  is  maximally  flexible  regarding  the 
characterization  of  the  subjects.  For  example,  it  may  well  be  that  £  ,  <  £  , > 

but  that  £.>£,.  v  v 

wi  wj 

The  characterization  of  the  subjects  at  time  is,  in  principle,  analo¬ 
gous;  it  is,  however,  restricted  by  the  assumption  that  change  in  each  subject 
can  be  described  by  a  single  parameter  5  ,  which  according  to  Equation  3  is  a 

linear  function  of  the  effects  Hj  of  the  given  treatments  (main  effects),  of 
their  interactions  p— ,  and  of  a  trend-parameter  t  comprising  all  the  causes  of 

change  that  are  unrelated  to  the  treatments.  The  constants  q  are  measures  of 
the  dose  of  treatment  as  applied  to  subject  \?.  3 
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The  most  important  properties  of  the  model  are: 

1.  Given  appropriate  data,  the  effect  parameters  n  j «  the  interactions  Pij, 
and  T  can  be  estimated  independently  of  the  true  values  of  the  parame¬ 
ters  5vi •  t'ie  latter  need  not  be  known  and  are  not  estimated  from  the 
data,  either.  This  means  that  any  proposition  referring  to  the  compar¬ 
ison  of  two  treatments  i  and  j  is  completely  independent  from  the  sam¬ 
ple  of  subjects  (specific  objectivity). 

2.  The  parameter  estimates  are  a  ratio  scale,  so  that  it  is  possible  to 
arrive  at  statements  such  as  "treatment  i  is  twice  as  effective  as 
treatment  j_." 

3.  It  is  possible  to  test  the  significance  of  single  parameters  and  to 
test  almost  any  conceivable  meaningful  composite  hypothesis  on  the  pa¬ 
rameters  by  means  of  likelihood-ratio  tests. 

The  formal  properties  of  the  model  have  been  studied  by  Fischer  (1972,  1974a, 
1974b,  1976,  1977a,  1977c;  see  also  Fischer  &  Rop,  in  prep.). 

The  sheer  enumeration  of  the  model  properties  does  not  sufficiently  reveal 
the  full  scope  of  the  possibilities  implied  by  these  properties.  An  illustra¬ 
tive  example  will  therefore  be  in  order. 

Sample  Application 

Research  questions.  Rop  (1977)  investigated  the  effects  of  three  preschool 
educational  programs  (Early  Reading,  Logical  Thinking,  and  Verbal  Enrichment)  on 
the  cognitive  development  of  kindergarten  children.  To  assess  change,  a  battery 
of  64  items  was  given  before  and  after  the  treatment  period;  a  control  group 
attended  kindergarten  but  did  not  participate  in  the  programs.  Three  primary 
questions  were  to  be  answered: 

1.  Is  it  possible  to  furnish  proof  that  the  programs  accelerate  cognitive 
development? 

2.  What  is  the  generality  of  the  effect  of  each  of  the  programs,  e.g.,  is 
there  an  effect  of  verbal  training  also  in  the  nonverbal  area? 

3.  What  do  the  socially  and  educationally  disadvantaged  children  gain  in 
comparison  to  middle-class  children,  i.e.,  is  early  intervention  a 
means  of  overcoming  the  deficiencies  resulting  from  less  privileged 
environments? 

The  first  question,  which  is  the  easiest  one,  was  answered  in  the  affirma¬ 
tive  by  testing  the  null  hypothesis  that  the  effect  parameters  are  zero. 

The  second  question  is  far  more  complex:  Equation  3  asserts  that  the  ef¬ 
fect  of  each  treatment  can  be  measured  by  just  one  parameter  per  person,  ir¬ 
respective  of  the  item  i.  Hence,  if  it  were  true  that  verbal  enrichment  had 
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little  or  no  effect  on  the  nonverbal  abilities  (which  were  tested  by  one  set  of 
nonverbal  items  in  the  test),  the  model  could  not  have  been  true  for  all  64 
items.  Hence,  the  model  plays  the  role  of  a  H0  against  the  Hj  of  differential 
effects  of  the  treatments  in  certain  subgroups  of  items,  i.e.,  the  criterion 
variables.  (There  is  a  far-reaching  analogy  between  this  model  and  the  well- 
known  analysis  of  variance  for  quantitative  data:  In  analysis  of  variance  as 
well,  one  begins  with  the  global  H0  that  all  means  are  equal.) 

Results.  In  Rop's  study  the  H0  of  uniform  effects  of  the  treatments  on  all 
the  ability  domains  represented  by  the  items  had  to  be  rejected.  As  Table  1 
shows,  the  64  items  had  to  be  Droken  down  into  three  subsets  (naming  of  objects, 
actions,  and  attributes;  verbal  abilities,  such  as  verbal  fluency,  enunciation, 
and  appropriate  usage  of  language;  and  nonverbal  abilities).  Each  of  the  three 
programs  had  a  differential  effect  within  each  of  the  three  domains.  However, 
the  results  in  Table  1  show  the  findings  of  the  study  in  a  maximally  generalized 
form:  It  is  an  essential  feature  of  the  model  that  it  identifies  the  maximal 

subsets  of  criterion  variables  with  uniform  treatment  effects.  This  is  a  conse¬ 
quence  of  the  principle  of  specific  objectivity,  viz.,  that  the  estimates  of 
effect  parameters  do  not  depend  on  any  irrelevant  factors,  such  as  subjects  or 
items,  as  long  as  the  model  holds.  In  other  words,  only  the  minimum  number  of 
moderator  variables  that  are  absolutely  necessary  to  explain  the  data  are  con¬ 
sidered  . 


Table  1 

Effects  of  the  Training  Programs  per  Time  Unit 
(1,000  minutes)  and  the  Trend  for  Naming, 
Verbal  Intelligence,  and  Nonverbal  Item  Groups 


Item  Group 
Verbal 


Treatment 

Naming 

Intelligence 

Nonverbal 

Reading 

.37* 

.15 

-.04 

Thinking 

.51* 

.16* 

.25* 

Verbal 

.49* 

.37* 

.31* 

Trend 

.84* 

.88* 

.32* 

*Stat ist ical ly  significant  at  p  <.01  (Adapted  from  Rop, 
1977). 


Rop's  third  question  is  the  most  intriguing  one.  If  conventional  methods 
of  data  analysis  had  been  applied,  it  would  have  been  expected  that  environmen¬ 
tally  privileged  children  with  a  higher  level  of  cognitive  development,  and 
hence  with  better  performance  at  t_j ,  would  not  have  increased  their  level  of 
performance  as  much  as  the  children  with  poor  achievement.  Such  methodological 
artifacts  are  known  under  the  names  "phycicalism-sub ject ivi sm-d i lemma ,"  "base- 
rate  problem,"  or  the  like  (see  Bereiter,  1967;  Lord,  1967).  The  LLTM,  on  the 
contrary,  asserts  that  the  effect  parameters  do  not  depend  on  the  subject  param¬ 
eters  ;  in  other  words,  if  the  effect  of  treatments  were  really  the  same  for 
all  children,  the  effect  parameters  estimated  from  groups  of  children  with  dif¬ 
ferent  ability  levels  should  also  be  equal  except  for  random  error.  This  is 


again  a  direct  consequence  of  the  principle  of  specific  objectivity.  In  Rop's 
study  it  was  found,  in  fact,  that  treatment  effects  were  independent  of  the  ini¬ 
tial  level  of  cognitive  development  and  therefore  that  the  preschool  programs 
were  not  appropriate  for  bridging  the  gap  between  privileged  and  underprivileged 
children. 


It  is  obvious  that  the  properties  of  the  LLRA  model  are  quite  advantageous, 
having  encouraged  a  variety  of  applications.  However,  a  better  theoretical  and 
epistemological  foundation  of  this  methodological  approach  seems  called  for,  and 
an  answer  was  sought  to  the  following  question:  If  assessment  of  change  is  to 
be  specifically  objective,  what  is  implied  with  respect  to  the  formal  structure 
of  the  model?  A  prerequisite  for  dealing  with  this  question  is  to  formalize  the 
problem  of  measurement  of  change  in  a  sufficiently  general  way. 

The  Model 

Change  is  detected  by  exposing  subjects  to  a  set  of  observational  condi¬ 
tions  such  as  the  test  items,  observation  of  symptoms,  or  registration  of  any 
other  kind  of  criterion  variables.  Let  the  behavioral  disposition  or  state  of 
the  subject  at  time  t_x  be  described  by  a  set  of  k_  parameters  p  j(£v  )»•••» 
pvk,i^vk^>  so  that  Pvi5l  is  associated  with  the  criterion  variable  i_  and  de¬ 
scribes  fully  the  latent  behavioral  disposition  of  subject  v  with  respect  to 
this  variable.  In  the  same  way,  let  the  state  of  subject  v  at  time  be  de¬ 
scribed  by  the  set  of  parameters  pvl  2(Cvi»  <$v  )  >  •  •  •  >  Pvk ,  2  ^vk  >  ^v^»  whereby  6V 
is  a  scalar  parameter  representing  change.  Nothing  is  assumed  concerning  the 
functional  concatenation  between  and  6V ;  it  is  only  assumed  that  the  reac¬ 

tion  tendency  Pvij2  at  time  t^  is  a  function  of  the  latent  trait  at  time  _t_j 
and  the  change  parameter  &v.  Since  the  objective  is  to  assess  change,  the  ex- 
iste'.^e  of  a  function  U(pVJ , 1 , • • • >Pvk, 1 >pv 1 , 2 » • • • »PVk, 2 ) »  which  can  be  solved 
for  6V ,  will  further  be  assumed.  In  other  words,  U  should  be  a  function  of  <$v 
alone:  U  =  V(<5v).  It  is  a  consequence  of  the  principle  of  specific  objectivity 

that  U  must  be  independent  of  the  latent  trait  parameters  £vi  and  of  the  sample 
of  observational  situations  chosen  for  assessing  change. 

On  the  basis  of  this  formalization  of  measurement  of  change,  the  following 
theorem  can  be  proven: 

Theorem  1.  Let  8pvi1i/9?vi  ^  ®  everywhere;  let  U  be  differentiable  with 
respect  to  pvifl  ana  Pvi)2>  pvi,i  with  respect  to  ;  and  Pv£>2  with  respect  to 

and  6V  for  _i_  =  1 , . . .  ,1c.  Further ,  le  t  U  be  a  function  Utpyj  p...  >Pvk,  1  * 

Pvi ,  2»  •  •  *  ,Pvk,2)  =v(&v)i  which  is  independent  of  ,  i_  =  1 , . . . ,  It .  Then,  there 
exist  monotone  transformations  of  all  the  parameters,  so  that  after  transforma¬ 
tion,  pvi)2  =  Pvi , 1  +  <$v*  addition,  the  observations  are  assumed  to  be 

realizations  of  Bernoulli  variables,  then,  except  for  scale  transformations, 
Equations  1  and  2  must  hold.  (The  proof  of  this  theorem  is  rather  complicated; 
see  Fischer,  1977b;  Fischer  &  Rop,  in  prep.;  Rasch,  1972).  The  meaning  of  the 
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theorem  is  this:  If  there  are  dichotomous  observations  at  two  points  of  time 
and  if  it  is  desired  to  assess  change  in  a  specifically  objective  manner  (i.e., 
if  the  result  should  not  depend  on  the  sample  of  person-parameters  £vi)»  then 
the  model  must  be  essentially  of  the  LLRA  type.  Of  course,  some  scale  transfor¬ 
mations  may  be  applied  on  the  parameter  dimensions,  entailing  formal  changes  of 
the  model,  but  any  model  and  empirical  result  obtained  in  this  manner  would  be 
completely  equivalent  to  what  is  obtained  by  means  of  the  LLRA.  Hence,  there  is 
no  point  in  transforming  the  parameters  and  thereby  departing  from  the  specifi¬ 
cally  simple  structure  of  the  LLRA. 


Estimating  Model  Parameters 

Since  this  theorem  legitimates  the  LLRA  as  theoretically  well  founded,  a 
short  discourse  on  the  technical  problems  of  parameter  estimation  and  hypotheses 
testing  is  called  for.  To  simplify  matters,  Equation  3  can  be  rewritten  as 


6  =  Zq  .ri  . 

V  .Hvo  3 

«7 


[4] 


This  can  be  done  because  Equation  3  is  linear  in  all  the  parameters;  hence,  pa¬ 
rameters  n.  and  matrix  Q  =  )]  need  to  be  redefined  appropriately. 


Theorem  2.  Let  At  =  [(av£  j)]  be  the  item-score  matrix  (with  elements  1  if 
the  response  was  "+”  and  0  if  )  for  time  t^  and  A.,  *  ( (a_vi , 2)]  for  time  t2, 
v  =  l,...,n  and  _i_  =  l,...,k.  The  conditional  maximum  likelihood  estimates  of 
The  effect- parameters  are  given  by  the  equations 


I 

V 


Iq 

i 


vi 


z 


w 

l  exp(tEq  .fi  .)t 
t-u  .7  V°  0 

w 

r  exp(tLq  -n.) 
t=u  3  3  3 


0  , 


[5] 


im !,..•» fc;  3m l**»**m» 
u(v,i )  *=  w(v,i)  *  0  if  avi  ,i+avi ,  2  =  0; 

u(v,i)  -  w(v,i)  -  1  if  avi t\+avi =  2;  and 

u(v,i )  -  0,  w(v,i)  =  1  if  avi fi+avit2  =  1  ' 


The  estimation  Equations  5  have  a  finite  solution  n 


>  0  if  for  ■  1, 


,m  holds 


l  lq 

V  i 


V3avi 


(1 


VV  ,  2 


)  >  o 


[6] 
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for  j  —  1 1 • • • j w j 

the  solution  is  unique  if  the  rank  of  Q  =  [  ( q  ))  equals  m,  and  it  is  at  a  maxi¬ 
mum  of  the  likelihood  function. 

The  estimation  Equations  5,  and  the  corresponding  second-order  partial  de¬ 
rivatives  that  are  needed  for  applying  the  Newton-Raphson  procedure,  were  given 
by  Fischer  (1972,  1974a,  1974b,  1977a,  1977c;  for  the  complete  proof  of  Theorem 
2,  see  Fischer  and  Rop,  in  prep.) 

This  theorem  is  not  only  useful  for  determining  the  existence  and  unique¬ 
ness  of  the  solution  but  it  also  implies  that  the  parameter  estimates  lie  on  a 
ratio  scale  with  a  unit  determined  by  the  time  interval  (tj,  t2)  together  with 
the  chosen  unit  of  measurement  of  dosage 

Hypothesis  Testing 


In  practical  applications,  the  estimation  of  the  parameters  is  only  a  first 
step.  More  important  is  the  test  of  hypotheses  on  the  parameters.  Such  tests 
can  be  carried  out  by  means  of  the  likelihood  ratio  principle. 

A  /N 

Let  n  =  (rij,..., nm)  be  the  estimates  of  effect  parameters  under  hypothesis 
H ^  (alternative  hypothesis)  and  let  L(H1)  be  the  maximized  conditional  likeli¬ 
hood  of  the  data  under 

lch^  =  n  n  Lvi »  with  [7] 

v  i 


eXP(a^,2 

J 

1  +  exp(Tqvjf\j) 
0 


if  avi, 1  +  avi,2 


(  1  if  a  .  ,  +  a  .  =  0  or  =  2  . 

I  ut,i  V%  ,1 

Further,  let  H0  be  a  null  hypothesis  consisting  of  the  restrictions  nj  - 

U;(Bi . 6  ,)  with  m'  <  m,  whereby  the  matrix  of  partial  derivatives  3p;/3Bi 

has  rank  m' .  Finally,  let  L(H0)  be  the  likelihood  of  the  data  under  Ho,  whereby 

the  maximum^likelihood  estimates  r)*  =  y  (Bj . 6  »)  are  inserted  in  Equation  6 

instead  of  n j  •  Then,  under  H0 ,  " 

-2ln\  =  -2 ln{L(H0)  -  t(«j)} 

is  asymptotically  chi-square-distributed  with  df  ■  m  -  m'.  tt  can  easily  be 
shown  that  most  hypotheses  relevant  in  practical  applications  can  be  formulated 

as  restrictions  Hi  =  p i ( 6 j . 3  .)  and  hence  can  be  tested  by  means  of  this 

likelihood  ratio  test.  As  long  as  the  restrictions  are  linear  contrasts,  esti¬ 
mation  Equations  5  can  be  used  for  estimating  the  parameters  under  H„  as  well; 
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otherwise,  the  estimation  equations  require  a  minor  adaptation,  which  need  not 
be  discussed  here. 


It  is  a  basic  feature  of  the  model  that,  in  a  formal  respect,  it  makes  no 
difference  whether  a  new  set  of  subjects  or  additional  criteria  are  added  to  the 
given  observations.  Therefore,  differences  of  treatment  effects  between  subsets 
of  subjects  and  between  subsets  of  criteria  lend  themselves  to  exactly  the  same 
kind  of  test.  Some  examples  of  hypotheses  typically  tested  in  the  applications 
are  the  following: 


1. 

2. 

3. 

4. 

5. 


All  interactions  are  zero  (p^j  =  0,  i_,  =  l,...,m). 

Some  treatments  are  equally  effective  (e.g.,  f|j  = 


Some  treatments  are  ineffective  (e.g.,  n.  =  0). 

The  trend  effect  is  zero  (t  =  0).  j 

The  effect  of  treatments  and/or  the  trend  effect  is  equal  for  different 
subgroups  of  subjects  or  in  different  subgroups  of  criteria  (r).(I)  = 
r|  (II)  for  Groups  I  and  II).  •* 


In  principle,  the  tests  are  logically  analogous  to  hypothesis  testing  in 
linear  analysis  of  variance  and  to  testing  linear  contrasts  between  groups  of 
mean  values. 


An  interesting  special  case  arises  when  testing  the  dose-response  relation¬ 
ship:  Leaving  aside  the  question  of  interactions,  the  model  Equation  3  presup¬ 
poses  that  the  effect  of  treatment  is  proportional  to  the  dose.  However,  gener¬ 
al  experience  indicates  that  in  some  cases  a  treatment  is  completely  ineffective 
below  a  certain  minimal  dose  and  that  above  a  certain  amount  of  treatment  satia¬ 
tion  occurs.  It  is  therefore  important  that  the  hypothesis  of  linearity.  Equa¬ 
tion  3,  is  tested  against  an  unspecified  nonlinear  dose-response  curve.  This 
can  be  done  by  means  of  the  following  parameterization:  Suppose  that  dose  is  no 
continuous  variable  but  assumes  certain  discrete  values  . . iig .  Then,  it  is 

possible  to  assign  one  parameter  r)-  ,...,r|-  to  each  of  these  doses.  To  embody 

J  1  J  s 

this  set  of  new  parameters  into  the  model,  let  =  (b^  . b^  ,  )  be  a  se¬ 

lection  vector  with  elements 


-vj 


U'  .  . .  ,  / 

-vj.l  -vj,S 


bvj  ,  t 


1  if  subject  V  has  obtained  dose  u ^  in  treat¬ 
ment  J ,  and 
0  otherwise. 


[9] 


The  selection  vector  for  each  combination  of  subject  v_  x  treatment  j_  consists  of 
0's  except  for  one  element,  which  is  equal  to  1  and  indicates  the  dose  obtained 
in  the  respective  treatment.  The  model  Equation  4  then  becomes 
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[10] 


Now,  consider  Equation  4  as  H0  and  Equation  10  as  Hj,  allowing  a  likelihood  ra¬ 
tio  test  of  the  linearity  hypothesis. 
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Applications  of  the  LLRA  in  Measuring  Change 

Rop's  study  on  the  effects  of  preschool  education  has  already  been  men¬ 
tioned  above:  The  programs  (e.g.,  logical  training)  did  not  just  affect  the 
narrow  domain  of  the  functions  trained  but  also  influnced  the  other  intellectual 
factors,  if  less  markedly.  This  problem  of  transfer  of  cognitive  operations  has 
been  discussed  by  Zeman  (1976),  who  investigated  the  effects  of  early  training 
in  elementary  set  theory;  she  proved  a  substantial  transfer  of  the  operations 
acquired  from  material  used  in  the  learning  phase  to  other  materials.  This 
finding  implies  that,  as  was  hoped,  this  specific  preschool  education  is  in  fact 
a  rather  general  vehicle  for  promoting  cognitive  development. 

An  interesting  application  of  the  LLRA  to  clinical  psychology  stems  from 
Heckl  (1976),  who  investigated  the  effects  of  three  forms  of  speech  therapy  in 
children  with  speech  disorders.  Contrary  to  expectation,  all  three  therapies 
proved  to  be  equally  effective.  The  interpretation  was  that  the  effect  appar¬ 
ently  was  brought  about  by  the  intensive  devotion  of  the  therapist  to  the  handi¬ 
capped  children  and  by  the  reinforcement  given  to  their  verbal  productions — rel¬ 
atively  independent  of  the  content  of  the  prescribed  exercises.  Heckl's  study 
is  one  of  the  few  where  the  linearity  of  the  dose-response  curve  was  empirically 
tested:  A  substantial  difference  in  effect  between  children  with  fortnightly 

therapeutic  sessions  and  children  with  one  session  per  week  was  observed;  the 
further  benefit  of  two  sessions  per  week,  however,  was  comparatively  small. 
Apparently,  satiation  occurred  in  the  latter  case.  As  in  Rop's  study,  the  ef¬ 
fect  parameters  were  (at  least  approximately)  constant  over  different  groups  of 
children,  i.e.,  independent  of  age,  sex,  and  only  partially  dependent  on  the 
degree  of  initial  speech  impediment. 

Another  study  in  the  domain  of  clinical  psychology  is  that  of  Glatz  (1977), 
who  investigated  the  effects  of  behavior  therapy  on  the  eating  performance  of 
mentally  retarded  children.  A  special  feature  of  this  study  is  that  observa¬ 
tions  were  made  at  eight  points  in  time,  yielding  a  behavioral  sequence  for  each 
child.  Glatz  used  the  LLRA  for  comparing  two  successive  points  of  time  each; 
strictly  speaking,  however,  another  type  of  linear  logistic  model,  the  LLTM, 
would  have  been  more  appropriate.  A  reanalysis  of  the  data  is  underway  (Fischer 
&  Rop,  in  prep.  ) . 

There  have  been  several  additional  applications  of  the  model.  Vodopiutz 
(1977)  studied  the  effects  of  certain  training  units  on  complex  movements  in 
gymnastic  educat  on;  Pendl  (1976),  the  effects  of  a  language  laboratory  on 
teaching  a  foreign  language  (English)  in  high  school;  Rella  (1976),  the  results 
of  driver  improvement  training  in  anticipating  dangerous  traffic  situations; 
Platzer  (1978),  the  effects  of  technical  playing  materials  on  the  development  of 
mechanical -technical  understanding;  Witek  (1979),  the  effects  of  a  group-dynamic 
sensitivity  training  for  business  executives;  and  Zimprich  (1979),  the  effects 
of  psychotherapy,  given  in  addition  to  chemotherapy,  to  patients  of  an  internal 
department  of  a  children's  hospital. 
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The  Polychotomous  LLRA 


The  Model 


Although  quite  often  the  data  are  readily  reducible  to  dichotomous  vari¬ 
ables,  in  many  cases  such  a  reduction  either  is  not  possible  or  makes  little 
sense.  In  spite  of  this,  designs  with  polychotomous  data  have  not  received 
enough  attention  in  the  literature  owing  to  the  lack  of  suitable  methodology. 
Already  in  the  early  papers  on  linear  logistic  models  for  measuring  change 
(Fischer,  1972,  1974a,  1974b,  1977c),  the  possibility  of  generalizing  the  LLRA 
to  polychotomous  data  has  been  recognized  and  the  necessary  estimation  equations 
have  been  derived.  Without  going  into  technical  details,  the  essentials  of  the 
parameterization  will  be  presented  here. 


Suppose  that  k  polychotomous  variables,  each  of  which  may  assume  one  of  r 
qualitative  or  quantitative  realizations,  are  the  basis  for  assessment  of 
change.  A  generalization  of  the  model  Equations  1  to  3  is  then 
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Thereby,  A  •  =  (A^j ^ , . . . ,Aj-r^)  is  an  indicator  vector-variable  with  realizations 
=  1  if  subjects  As  reaction  on  criterion  i_  was  in  category  _h ,  and  a^£  =  0 

otherwise.  The  state  of  each  subject  at  _t_j  is  characterized  by  a  matrix  of  pa¬ 
rameters  and  "change"  is  described  by  a  vectorial  parameter  6V  =  (S^1^,..., 

6^ ) ;  element  6^)  measures  the  effect  of  the  treatments  with  respect  to  reac¬ 
tion  category  h_.  Analogously,  the  effect  of  each  treatment  is  described  by  a 
vectorial  parameter  rij  =  (r|j  1  ^ •  ,r|jr^  ),  its  elements  being  assocated  with  the 

specific  effect  of  treatment  with  respect  to  response  category  h.  To  be  more 
concrete,  the  behavior  categories  of  a  depressive  patient  could  be,  for  example, 
agitated ,  withdrawn ,  and  normal ;  a  certain  psychiatric  treatment  could  then  have 
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a  very  strong  effect  of  reducing  agitation  and  increasing  withdrawal  without, 
however,  necessarily  increasing  the  rate  of  normal  behavior. 


Several  such  qualitative  categories  may  as  well  express  different  levels  of 
an  underlying  latent  dimension,  i.e.,  different  degrees  of  one  behavioral  ten¬ 
dency.  A  typical  example  would  be  the  categories  very  content,  rather  content, 
rather  not  content,  not  at  all  content,  reflecting  degrees  of  satisfaction 
(e.g.  ,  with  a  job) .  The  case  of  unidimensionality  of  the  response  catgories 
with  respect  to  the  treatments  is  then  formalized  as  follows: 
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Equation  14  has  been  called  the  reduction  conditions;  of  course,  it  is  a  purely 
empirical  matter  whether  they  hold  or  not.  If  they  hold,  the  matrix  of  parame¬ 
ters  is  °f  rank  1.  The  parameters  <|>^)  are  called  the  category  weights. 

As  in  the  dichotomous  LLRA,  the  effect  parameters  and  the  trend  effects  can 
be  estimated  empirically,  independent  of  the  person  parameters  ,  which  char¬ 

acterize  the  state  of  the  sample  at  t_t .  Furthermore,  hypotheses  are  testable  by 
means  of  likelihood  ratio  tests.  One  reservation,  however,  must  be  made  regard¬ 
ing  the  reduction  conditions:  When  the  parameters  are  estimated  under  assump¬ 
tion  of  Equation  14,  the  solutions  of  the  estimation  equations  are  not  necessar¬ 
ily  unique. 

Applications  of  the  Polychotomous  LLRA 

The  numerical  computations  for  estimating  parameters  in  the  polychotomous 
case  are  much  more  complex  than  in  the  dichotomous  case,  and  some  theoretical 
questions  need  further  investigation  (as,  for  example,  uniqueness  of  the  solu¬ 
tion  in  case  of  the  reduction  conditions).  In  addition,  the  amount  of  data  re¬ 
quired  is  much  larger  than  in  the  dichotomous  LLRA.  For  these  reasons,  only  a 
few  empirical  applications  have  been  realized  so  far.  Nevertheless,  the  poly¬ 
chotomous  LLRA  is  a  potentially  powerful  instrument  for  assessing  change,  as 
will  be  illustrated  by  the  following  two  empirical  studies. 

Hammer  (1978)  investigated  the  cognitive  and  attitudinal  effects  of  a  mul¬ 
ti-media  presentation  dealing  with  forms  of  human  settlement,  problems  of  big 
cities,  and  ecology.  The  presentation  was  viewed  by  one  sample  of  high-school 
children,  whereas  another  sample  received  instruction  on  the  same  topics  from  a 
teacher.  The  cognitive  effects  of  both  methods  of  instruction  were  measured  by 
a  questionnaire  with  the  three  response  categories  correct ,  partially  correct, 
and  incorrect ;  the  attitudinal/emot ional  effects  were  evaluated  by  another  ques¬ 
tionnaire  with  categories  positive ,  neutral ,  negative,  and  don 1 1  know.  As  ex¬ 
pected,  the  multi-media  presentation  proved  to  be  generally  more  effective  than 
the  teacher,  especially  so  with  respect  to  the  domain  of  attitudinal  and  emo¬ 
tional  change;  the  teacher  was  able  to  impart  knowledge  rather  than  to  influence 
attitudes  or  to  appeal  to  emotions. 
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The  second  example  of  an  application  of  the  polychotomous  LLRA  returns  to 
the  problem  of  measuring  effects  of  mass  communication  mentioned  earlier: 
Kropiunigg  (1979)  carried  out  a  field  study  on  a  topical  problem  of  social  and 
political  interest  in  Austria  on  the  reform  of  penal  law  in  1975.  In  Styria, 
one  of  the  nine  provinces  of  Austria,  an  informational  campaign  on  this  topic 
was  promoted  by  the  Regionalprogramm  Studio  Steiermark  (radio)  and  by  the  Kleine 
Zeitung  Graz  (newspaper),  whereby  problems  of  probation  and  resocialization  of 
convicts  were  dealt  with. 

Before  and  after  the  campaign,  representative  samples  of  the  population 
were  interviewed  (tj:  n  =  550;  t_2  •  n  =  640).  The  questionnaire  comprised  items 
referring  to  three  attitudinal  domains  and  one  set  of  items  for  assessing  famil¬ 
iarity  with  relevant  facts.  Since  the  subjects  interviewed  at  ti  and  t2  (unlike 
the  case  of  the  standard  LLRA)  were  not  the  same,  a  modified  version  of  the  mod¬ 
el  for  independent  samples  had  to  be  used  (see  Fischer,  1972,  1974a,  1974b, 
1977c). 

This  study  differed  from  those  of  the  other  above-mentioned  investigations 
in  one  essential  respect:  It  was  not  possible  to  obtain  generalizable  proposi¬ 
tions  with  respect  to  the  effects  of  the  media.  The  results  rather  supported 
the  standard  conjecture  of  communication  theory:  that  effects  of  communications 
are  strongly  determined  by  a  number  of  moderator  variables  (e.g.,  socioeconomic 
factors).  Only  the  result  that  the  radio  programs  were  more  effective  than  the 
respective  articles  of  the  daily  newspaper  was  of  some  generality.  Those  seg¬ 
ments  of  the  population  characterized  by  high  contact  frequency  with  the  radio 
programs  in  question  showed  satiation  regarding  the  information  on  the  issue ;  an 
increase  of  density  of  the  pertinent  information  in  the  newspaper,  on  the  other 
hand,  would  still  have  increased  the  effects  of  the  campaign.  A  somewhat  unex¬ 
pected  finding  was  the  relatively  limited  acceptance  of  the  promoted  ideas  by 
women  and  by  religious  people,  whereas  supporters  of  the  (governing)  socialist 
party  showed  significantly  above-average  understanding. 

The  principal  goal  of  giving  a  simple  characterization  of  each  medium  by  a 
few  effect  parameters  — which  had  originally  led  to  the  development  of  the 

LLRA  and  other  linear  logistic  models — was  not  reached  in  this  empirical  study. 
Perhaps  the  epistemological  basis  of  these  considerations  is  not  appropriate  for 
the  complex  problem  of  social  science.  But  the  theoretical  developments  and  the 
applications  in  other  fields,  as  mentioned  above,  indicate  that  it  was  worth¬ 
while  to  venture  models  that  derive  very  simplified  and  generalized  results  from 
complex  bodies  of  qualitative  data. 
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The  Mental  Growth  Curve  Re-examined 


R.  Darrell  Bock 
University  of  Chicago 


A  study  purporting  to  show  the  growth  of  mental  ability,  as  measured  by  the 
Binet  test,  as  a  function  of  chronological  age  was  published  in  1929  by  Thur- 
stone  and  Ackerson.  The  curve  was  published  on  a  rescaling  of  Binet  mental  ages 
(MA)  of  a  cross-sectional  sample  of  4,208  children  from  ages  3  through  17,  seen 
at  the  Institute  for  Juvenile  Research  in  Chicago.  The  shape  of  that  curve, 
which  is  reproduced  in  Figure  1,  is  surprising  in  one  respect:  It  shows  an  in¬ 
flection  point  at  about  10  years  of  age,  where  an  initial  positive  acceleration 

Figure  1 

Thurstone's  Curve  for  Binet  Mental  Growth 
(from  Thurstone  &  Ackerson,  1929) 


<1> 
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Chronological  Age 

switches  to  negative.  There  is  no  precedent  for  this  type  of  growth  curve  in 
any  other  aspect  of  human  growth.  All  other  such  curves — in  particular,  those 
for  growth  in  stature  (see  Bock  &  Thissen,  1980) — show  a  rapid  deceleration  from 
birth  through  adolescence,  followed  by  a  brief  period  of  acceleration  during  the 
adolescent  growth  spurt.  (In  longitudinal  growth  records  of  individual  chil- 
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dren,  a  slight  middle-childhood  spurt  can  sometimes  also  be  seen  between  6  and  7 
years,  but  this  is  not  evident  in  cross-sectional  data.) 

Any  discussion  of  the  shape  of  such  curves  requires  that  the  unit  of  scale 
be  equal  at  all  points  throughout  the  range  of  measurement.  Because  there  is  no 
reason  to  suppose  that  MA  scores  for  the  Binet  have  this  property,  some  method 
of  scaling  the  test  responses  that  will  yield  a  uniform  unit  must  be  adopted. 
Thurstone  (1925,1927,1928)  formulated  such  a  method.  It  rests  on  two  very  gen¬ 
eral  assumptions:  (1)  that  the  distributions  of  mental  age  (or  attainment)  con¬ 
ditional  upon  chronological  age  have  the  same  (continuous)  functional  form  at 
all  age  levels  but  may  differ  in  mean  and  dispersion  (standard  deviation);  (2) 
that  the  origin  of  measurement  can  be  assigned  so  that  the  dispersion  of  the  , 
conditional  distributions  is  directly  proportional  to  the  mean,  that  is,  so  that 
the  coefficient  of  variation  is  constant. 

Thurstone  pointed  out  that  if  the  functional  form  of  the  common  distribu¬ 
tion  is  known,  these  assumptions  may  be  checked  (1)  by  converting  the  observed 
proportions  of  people  at  each  age  level  who  respond  correctly  to  each  test  item 
to  the  corresponding  percentage  point  of  distribution  and  (2)  by  plotting  the 
resulting  transformed  proportions  as  a  function  of  age.  If  the  points  tend  to 
lie  on  straight  lines  and  the  slopes  of  the  lines  decrease  with  increasing  age, 
the  assumptions  are  justified.  Thurstone  (1925,1927,1928)  exhibited  numerous 
examples  of  data  in  which  these  assumptions  seem  reasonable  when  the  conditional 
distributions  are  assumed  normal.  He  also  developed  simple  numerical  methods 
for  estimating  the  item  means  (thresholds)  and  the  constants  of  proportionality 
for  the  item  standard  deviations.  He  called  this  procedure  the  "method  of  abso¬ 
lute  scaling."  Although  the  method  is  no  longer  used,  it  is  important  as  a  fore¬ 
runner  of  modern  item  characteristic  curve  (ICC)  scaling  procedures. 

However,  this  method  was  not  used  directly  on  the  item  data  by  Thurstone 
and  Ackerson  (1929);  rather,  they  obtained  the  means  and  constants  of  propor¬ 
tionality  indirectly  from  the  MA  distributions  of  yearly  age  groups.  (In  sup¬ 
plementary  tables,  the  actual  data  distributions  are  given  in  3-month  intervals 
for  boys  and  girls  separately,  with  boys  substantially  outnumbering  girls  in  the 
sample.)  This  labor-saving  compromise  of  the  absolute  scaling  method  can  be  jus¬ 
tified  on  grounds  that  the  mean  and  dispersion  obtained  from  the  average  percent 
correct  for  items  represented  in  the  MA  score  will  be  a  good  approximation  to 
the  average  of  the  means  and  dispersions  of  the  separate  items.  There  is  no 
reason  to  believe  that  the  unusual  characteristics  of  the  Thurstone-Ackerson 
curve  for  mental  growth  curve  are  due  to  their  scaling  the  Binet  data  at  the 
score  level  rather  than  at  the  item  level. 

A  more  plausible  explanation  is  that  the  shape  of  the  curve  is  influenced 
by  Thurstone' s  use  of  the  observed  ratios  of  MA  dispersions  in  successive  chro¬ 
nological  age  groups  to  determine  the  factor  of  proportionality  (coefficient  of 
variation)  tt.lative  to  the  mean  scaled  mental  age.  The  growth  curve  thus  ob¬ 
tained,  although  independent  of  the  arbitrary  Binet  MA  scale  in  the  conditional 
means,  is  not  independent  of  the  scale  in  the  calculation  of  the  conditional 
dispersions.  A  solution  independent  of  arbitrary  scale  artifacts  in  both  item 
thresholds  and  dispersions  was  not  practical  with  the  hand  methods  of  computa¬ 
tion  then  available  to  Thurstone. 
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A  Scaling  Procedure  Fully  Independent  of  Chronological  Age  Units 

With  the  aid  of  modern  computers,  however,  the  Binet  test,  or  similar  tests 
referenced  to  chronological  age  or  to  other  external  criterion,  can  be  scaled  on 
the  single  assumption  that  the  underlying  distributions  for  item  attainments 
conditional  on  age  have  a  known  common  functional  form  indexed  by  a  threshold 
and  a  dispersion  parameter.  If  this  assumption  is  satisfied,  scale  values  may 
be  assigned  to  the  chronological  age  groups  so  that  with  respect  to  growth  con¬ 
tinuum,  all  the  ICCs  simultaneously  fit  the  observed  percent-correct  data  for 
each  item  in  each  age  group.  On  the  further  assumption  that  item  responses 
within  the  age  groups  are  independent  (locally  independent),  the  goodness  of  fit 
of  the  solution  can  be  tested  by  a  large-sample  statistical  test. 

A  Biological  Example 


A  maximum  likelihood  procedure  for  scaling  by  this  method,  when  a  normal 
ogive  ICC  is  assumed,  is  presented  in  the  appendix  to  Bock  (1976).  This  proce¬ 
dure  has  been  applied  by  Kolakowski  and  Bock  (in  press)  to  biological  data  con¬ 
sisting  of  counts  of  emerged  permanent  dentition  in  a  large  cross-sectional  sam¬ 
ple  of  Pima  Indian  children  (Dahlberg  &  Menegaz-Bock ,  1958).  Reproduced  in  Fig¬ 
ure  2  are  the  scale  values  obtained  by  Kolakowski  and  Bock  (in  press),  plotted 

Figure  2 

Bevelopmental  Age  Curves  Inferred  from  Emergence 
of  Permanent  Dentition  in  Pima  Indian  Children 
(from  Kolakowski  &  Bock,  1980) 


c 'hronoli'Kir  •*  1  Attt*  (Year-) 


as  a  function  of  chronological  age.  As  can  be  seen,  the  curves  based  on  inci¬ 
dence  of  emerged  permanent  teeth  initially  decelerate  and  show  some  suggestion 
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of  an  adolescent  growth  spurt  in  both  sexes.  There  is  no  evidence  of  the  ini¬ 
tial  positive  acceleration  that  was  found  in  the  mental  growth  curve  by  Thur- 
stone  and  Ackerson  (1929). 

An  unavoidable  limitation  of  all  such  scaling  methods  is  that  the  origin 
and  unit  of  measurement  of  the  scale  is  arbitrary  in  each  sample  analyzed.  In 
the  case  of  the  tooth  emergence  data,  Kolakowski  and  Bock  (in  press)  adjusted 
the  origin  and  unit  so  that  the  threshold  and  dispersion  of  one  of  the  teeth 
that  is  known  to  show  no  sex  difference  in  emergence  time,  an  upper  central  in¬ 
cisor,  had  the  same  values  as  those  in  the  literature  based  on  probit  analyses 
of  tooth  frequencies  as  a  function  of  chronological  age  (Dahlberg  &  Menegaz- 
Bock,  1958).  The  curves  for  the  two  sexes  in  Figure  2  are  based  on  this  choice 
of  origin  and  unit.  Thurstone  and  Ackerson  (1929)  based  the  origin  of  their 
scale  at  an  inferred  point  of  zero  variability  (Thurstone 1 s ,  1928, "absolute 
zero"  of  intelligence)  and  set  the  unit  so  that  the  MA  of  the  year  group  equaled 
chronological  age  (CA). 


Scaling  the  Binet  Test 


Data  and  Method 


Using  data  supplied  by  Reckase  (1979),  the  Bock  (1976)  procedure  was  ap¬ 
plied  to  96  items  of  the  current  version  of  the  Stanford-Binet .  These  data, 
which  are  reproduced  in  Appendix  Tables  A  and  B,  are  drawn  from  the  full  comple¬ 
ment  of  122  Binet  tasks,  with  the  first  13  omitted  because  all  subjects  respon¬ 
ded  correctly  and  the  last  13  omitted  because  all  subjects  responded  incorrect¬ 
ly. 


The  numbers  and  mean  age  of  boys  and  girls  in  each  CA  group  are  shown  in 
Table  1.  In  some  instances,  alternative  forms  of  an  item  were  treated  in  the 
scaling  as  if  they  were  the  same  item.  The  data  are  strictly  cross-sectional 
and,  like  all  such  data,  are  not  constrained  to  be  increasing  with  chronological 
age  (see  Bock,  1979). 

The  scaling  solutions  based  on  Bock  (1976)  converged  in  13  Newton-Raphson 
iterations.  The  computations  required  63  seconds  of  IBM  370/168  cpu  time  and 
465K  bytes  of  core  storage.  Scale  values  for  successive  10-month  chronological 
age  groups  were  calculated.  The  origin  and  unit  of  the  scale  for  boys  were 
fixed  so  that  the  values  for  the  40-month  and  160-month  groups  were  40  and  160, 
respectively.  The  unit  of  the  girls'  scale  was  then  set  so  that  the  averages  of 
the  item  dispersions  for  boys  and  girls  were  equal  (to  19.55);  and  the  origin  of 
the  girls'  scale  was  set  so  that  the  scale  value  of  the  160-month  group  was  160. 

Results:  The  Revised  Mental  Growth  Curve 


The  growth  curves  from  this  scaling  solution  are  shown  in  Figure  3.  The 
curve  for  boys  i»  entirely  plausible  as  a  representation  of  growth.  Unlike  the 
Thurstone-Ackerson  curve,  it  decelerates  from  the  earliest  age  until  adoles¬ 
cence.  The  final  two  points  suggest  the  possible  upward  inflection  of  a  slight 
adolescent  growth  spurt  in  mental  attainment.  At  age  14  the  curve  is  still  ris¬ 
ing,  and  presumably  would  go  higher  if  older  age  groups  were  included. 


ft 
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Table  1 

Mean  Chronological  Age  (CA)  and  Sample  Size 
for  Each  Age  Group 


Age 

Group 

CA  Interval 
in  Months 

Boys 

(N=342) 

Girls 

(N=81) 

Mean 

N 

Mean 

N 

1 

24-36 

30.9 

17 

31.8 

10 

2 

37-46 

42.6 

26 

40.7 

30 

3 

47-56 

50.5 

29 

51.4 

25 

4 

57-66 

61.3 

32 

60.9 

28 

5 

67-76 

71.4 

35 

72.2 

22 

6 

77-86 

81.7 

22 

81.4 

21 

7 

87-96 

91.5 

29 

91.6 

18 

8 

97-106 

102.1 

25 

101.3 

18 

9 

107-116 

111.6 

24 

111.5 

23 

10 

117-126 

121.3 

25 

121.7 

15 

11 

127-136 

132.1 

19 

130.6 

16 

12 

137-146 

141.3 

12 

141.6 

20 

13 

147-156 

151.3 

15 

151.3 

8 

14 

157-166 

162.0 

16 

160.5 

13 

15 

167-178 

170.9 

16 

173.1 

14 

The  curve  for  girls  is  less  satisfactory.  Initially,  it  resembles  the 
curve  for  boys;  but  from  years  6  through  11,  the  scale  values  for  girls  are  ir¬ 
regular  and  considerably  below  those  for  boys  of  the  same  age.  It  is  possible, 
of  course,  that  the  equating  of  boys  and  girls  at  160  months  is  unfair  to  the 
girls.  Perhaps  they  are  actually  10  or  20  points  higher  at  that  age.  If  so, 
the  points  in  the  range  70  to  130  months  would  be  more  comparable  in  boys  and 
girls. 

Such  an  adjustment,  however,  would  make  the  girls'  scores  in  the  range  30 
to  60  high  relative  to  those  of  the  boys.  Inasmuch  as  the  percents  correct  for 
girls  on  items  in  this  range,  or  indeed  in  the  upper  range  of  140  to  170  months, 
were  about  the  same  as  those  for  the  boys,  this  interpretation  does  not  seem 
olausible  (compare  Tables  A  and  B).  The  assumption  that  boys  and  girls  have  the 
same  average  Binet  attainment  at  160  months  seems  reasonable  for  these  data. 

The  only  explanation  for  the  anomalous  result  for  girls  would  seem  to  be 
that  the  samples  for  the  two  sexes  were  not  comparable  in  some  age  groups.  Some 
bias  in  selection  of  subjects  or  in  administration  of  the  tests  must  have  oper¬ 
ated  against  girls  in  the  70-  to  130-month  range.  The  irregularity  of  the 
girls'  scale  values  in  this  range,  especially  the  discrepant  value  at  100 
months,  suggests  that  the  sample  of  girls  may  have  been  defective.  Regrettably, 
no  information  is  available  on  how  the  subjects  were  selected  or  how  the  tests 
were  administered. 
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Figure  3 

Proposed  Mental  Growth  Curve  Based  on  Binet 
Item  Data  Collated  by  Reckase  (1979) 


Advantages  of  the  Present  Scaling  Procedure 


Because  the  present  scaling  method  does  not  force  the  dispersions  of  the 
conditional  distributions  to  increase  with  age,  the  scale  is  not  stretched  to 
the  left  in  order  to  make  the  conditional  standard  deviations  small.  It  is  this 
stretching  of  the  scale  that  induces  the  initial  positive  acceleration  in  the 
Thurstone-Ackerson  curve.  When  the  dispersions  were  estimated  without  con¬ 
straint,  the  more  plausible  initial  negative  acceleration  seen  in  Figure  3  is 
obtained. 

As  discussed  in  Bock  (1976)  and  Kolakowski  and  Bock  (in  press),  the  item 
parameters  estimated  in  the  scaling  solutions  can  also  be  used  to  assign  devel¬ 
opmental  age  scores  to  individual  subjects  by  the  method  of  maximum  likelihood 
(see  also  Birnbaum,  1968;  Samejima,  1969).  In  this  role  the  present  scaling 
solution  has  important  methodological  advantages.  On  the  developmental-age 
scale,  the  item  dispersions,  rather  than  increasing  as  Thurstone  had  assumed, 
are  relatively  homogenous.  A  solution  with  all  item  standard  deviations  set  to 
their  average  value  fit  almost  as  well  as  the  unconstrained  solution.  This  im¬ 
plies  that  the  maximum  likelihood  estimates  of  developmental  age  of  individual 
subjects  can  be  expressed  with  good  accuracy  as  a  function  of  the  subject's  num¬ 
ber-correct  score.  This  is  implied  by  the  close  similarity  of  a  1-parameter 
nn:ma 1  ogive  model  with  the  1-parameter  logistic  model  in  which  number  correct 
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is  the  sufficient  statistic  for  the  maximum  likelihood  estimate  (Andersen, 

1980.) 

Moreover,  when  the  within-age  group  standard  deviations  of  the  estimated 
developmental  age  scores  were  calculated  (Table  2),  they  were  also  relatively 
homogeneous.  This  means  that  analysis  of  variance  can  be  employed  to  investi¬ 
gate  relationships  between  developmental  age  and  other  age-structured  data  with¬ 
out  violating  the  assumption  of  homoscedasticity.  The  conventional  MA  scores 
for  the  Binet  do  not  have  this  property. 

Table  2 


Developmental  Age  Means  and 
Standard  Deviations  for  Children 
in  Successive  Chronological  Age  (CA)  Groups 


Age 

Group 

Nominal 

CA 

Boys 

Girls 

Mean 

SD 

Mean 

SD 

1 

30 

15.4 

16.6 

16.9 

18.1 

2 

40 

40.0 

19.2 

46.0 

12.6 

3 

50 

61.0 

18.0 

62.4 

14.1 

4 

60 

76.2 

17.3 

76.9 

22.1 

5 

70 

96.6 

13.5 

84.1 

10.0 

6 

80 

110.9 

10.1 

92.1 

10.2 

7 

90 

120.6 

13.7 

101.5 

10.1 

8 

100 

125.8 

13.2 

105.0 

11.5 

9 

110 

134.0 

13.4 

121.1 

12.3 

10 

120 

143.3 

10.7 

134.0 

16.7 

11 

130 

144.3 

15.6 

134.7 

16.4 

12 

140 

148.3 

9.7 

143.9 

14.9 

13 

150 

152.1 

17.4 

153.0 

19.9 

14 

160 

160.0 

8.2 

160.0 

19.1 

15 

170 

162.0 

16.3 

175.1 

38.0 

The  developmental  age  scale  may  also  have  certain  interpretational  advan¬ 
tages  whenever  changes  within  subjects  rather  than  normative  comparisons  with 
age-mates  is  at  issue.  Because  the  developmental  age  units  are  greater  than 
chronological  units  at  young  ages,  the  changes  in  scale  values  are  more  in  ac¬ 
cord  with  the  rate  of  behavioral  change  (i.e.,  the  surpassing  of  successive  de¬ 
velopmental  tasks)  than  with  changes  in  MA  scores.  Moreover,  the  growth  of  men¬ 
tal  attainment  in  terms  of  scaled  scores  will  parallel  closely  other  quantita¬ 
tive  indices  of  development,  such  as  stature.  Thus,  the  developmental  age 
scores  will  tend  to  show  simple  linear  relationships  with  direct  measures  of 
development . 

The  present  scale,  however,  does  not  exhibit  increasing  standard  deviation 
with  age  and  thus  does  not  support  Thurstone's  definition  of  the  absolute  zero 
of  intelligence.  Provisionally,  at  least,  it  will  be  necessary  to  set  the  ori¬ 
gin  of  the  scale  on  some  more  arbitrary  basis. 


r+ 
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APPENDIX :  Supplementary  Tables 
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Latent  Structure  Estimation  for 
Assessing  Gain  in  Ability 


Lalitha  Sanathanan 
Argonne  National  Laboratory 


This  paper  deals  with  methods  for  assessing  the  progress  of  an  individual 
or  group  through  time.  The  methods  involve  (1)  measuring  the  gain  in  ability 
over  a  given  period  of  time  using  a  latent  ability  model,  such  as  the  Rasch  mod¬ 
el  and  (2)  relating  this  gain  to  the  average  gain  for  similar  individuals  or 
groups  over  the  same  length  of  time.  The  changes  in  ability  parameters  for  in¬ 
dividuals  and  for  groups  can  be  estimated  through  existing  methods  based  on  la¬ 
tent  trait  models.  However,  in  order  to  judge  whether  a  specific  individual  or 
group  has  progressed  satisfactorily,  it  is  necessary  to  compare  the  given  gain 
in  ability  with  gains  for  similar  individuals  or  groups. 

It  is  common  practice  to  report  test  scores  based  on  a  hierarchical  test 
system  such  as  the  Iowa  Tests  of  Basic  Skills  (ITBS)  in  the  form  of  grade  equiv¬ 
alent  scores.  The  grade  equivalent  of  any  given  test  score  is  approximately  the 
grade  whose  mean  is  the  given  score.  Its  principal  use  is  to  measure  the  prog¬ 
ress  of  an  individual  or  group  over  a  given  period  of  time.  The  increase  in 
grade  equivalent  scores,  referred  to  as  the  gain  score,  is  considered  a  measure 
of  this  type  of  longitudinal  progress.  In  spite  of  numerous  problems  in  the 
interpretation  of  grade  equivalent  scores,  the  gain  score  has  a  certain  appeal 
in  that  it  tries  to  express  progress  in  terms  of  gain  in  years.  This  paper  pro¬ 
vides  a  measure  of  longitudinal  progress  that  is  interpretable  in  terms  of  gain 
in  years  but  overcomes  the  objections  to  the  use  of  grade  equivalent  scores. 

The  measure  proposed  here  is  obtained  by  first  using  the  Rasch  model  of  latent 
ability  to  measure  gain  in  ability  on  a  non-normat i ve  scale,  and  then  providing 
a  normative  interpretation  for  this  gain. 

Let  0,  and  02  be  the  values  of  the  ability  parameter  for  an  individual  at 
times  t_j  and  t_  .  Given  an  initial  ability  level  of  01O  and  a  gain  of  02u  -  01O 
over  the  period  t_  -  t_  ,  asst  ssment  of  this  gain  can  be  made  on  the  basis  of  the 
conditional  distr  lbut  ion  of  •  ,  given  0t  =  fi  J;l  for  a  norm  group,  such  as  a  na¬ 
tional  sample.  In  particular,  the  mean  and  standard  deviation  of  this  condi¬ 
tional  distribution  enable  the  expression  of  an  absolute  gain  as  a  percentile 
gain,  which  in  turn  has  the  usual  interpretation. 

This  paper  provides  an  empirical  Bayes  procedure  for  computing  the  parame¬ 
ters  of  the  above  conditional  distribution  needed  for  this  type  of  judgment.  On 
the  basis  of  the  estimated  parameters,  the  time  it  takes,  on  the  average,  for  an 

individual  with  initial  ability  0  to  achieve  a  gain  of  0  -  0  can  also  be 
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computed,  thus  making  possible  the  expression  of  progress  on  a  chronological 
scale.  Two  other  related  applications  of  the  empirical  Bayes  procedure  are  also 
discussed. 

The  Rasch  Model 


The  Rasch  model  can  be  described  as  follows:  Let  9  denote  a  real -valued 
parameter  representing  the  ability  of  an  individual,  and  let  j>(0)  be  the  proba¬ 
bility  that  an  individual  with  parameter  0  will  correctly  solve  item  j_  from  a 
given  pool  of  items.  The  Rasch  model  specifies  that 

p  .  (0 )  =  exp{0  +  4>-}  /  (1  +  exp{0  +  $.}),  j  =  [1] 

J  J  J 

or,  equivalently, 

logit  jgj(0)  =  0  ♦  <J>  j  ,  j  *  l,...,m,  where  <J>j  is  a  real -valued  parameter  charac¬ 
terizing  the  difficulty  of  the  item  and  m  is  the  number  of  items. 

Consider  a  group  of  individuals  with  ability  parameters  0£  whose  responses 
to  j  items  are  observed.  Under  the  assumption  that  individuals  respond  indepen¬ 
dently  of  one  another  and  that  for  the  same  individual,  responses  to  different 
x  1 6018  are  uut  ually  independent,  maximum  likelihood  or  other  estimates  of  the 
0j/s  and  4>j's  can  be  obtained  (for  details  see  Anderson,  1970;  Wright  &  Pancha- 
pakesan,  1969).  Assume  that  the  raw  scores  for  an  individual  at  two  points  in 
time — and  t^ — are  based  on  two  different  tests,  such  as  those  corresponding 
to  different  hierarchical  levels  of  a  test  system.  Assuming  that  the  items  on 
the  two  tests  are  calibrated  and  that  estimates  of  the  item  parameters  are 
available,  the  raw  scores  Xj  and  Xz  would  be  used  separately  on  the  two  tests  to 
estimate  the  abilities  0j  and  02  for  the  individual  at  t_t  and  t_  ,  respectively. 
There  would  thus  be  an  estimate  of  the  gain  in  ability  ~  for  the  individu¬ 
al  over  the  period  t_j  to  t^.  This  measure,  however,  has  very  little  meaning, 
unless  it  is  given  a  normative  interpretation.  It  does  not,  for  instance,  de¬ 
note  whether  a  specific  individual  has  progressed  satisfactorily. 

In  order  to  make  a  judgment  of  this  nature,  it  is  necessary  to  compare  02  - 
for  the  given  individual  with  gains  for  similar  individuals.  The  conditional 
distribution  of  ©2 ,  given  9 x  for  a  norm  group  provides  a  useful  basis  for  the 
above  comparison.  The  mean  and  standard  deviation  of  this  conditional  distribu¬ 
tion  are  relevant  measures  by  which  the  gain  ©2  -  8j  in  an  individual's  ability 
can  be  judged.  This  type  of  comparison  involving  conditional  averages  is  more 
appropriate  than  the  one  based  on  grade  equivalent  scores,  since  the  former 
takes  into  account  the  fact  that  gain  in  ability  itself  is  dependent  on  initial 
ability  level,  whereas  for  the  latter,  comparison  averaging  is  done  over  all 
individuals  in  certain  norm  groups  without  regard  to  their  abilities.  For  this 
reason  gain  expressed  in  terms  of  grade  equivalent  scores  is  likely  to  be  in¬ 
flated  if  an  individual  with  a  high  initial  ability  is  being  considered,  the 
opposite  being  true  in  the  case  of  low-ability  individuals.  Such  distortions 
are  avoided  by  the  proposed  method  based  on  conditional  averages. 
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An  Empirical  Bayes  Model  for  Assessing  Gain  in  Ability 

The  need  for  estimating  the  mean  and  standard  deviation  of  the  conditional 
distribution  of  02  (ability  at  time  £3),  given  (ability  at  time  t^),  for  a 
specified  group  has  been  established  in  the  previous  section.  In  this  section  a 
suitable  model  and  a  method  for  obtaining  these  estimates  is  outlined  for  the 
estimation  process. 

The  Model 


Consider,  for  instance,  a  group  of  individuals  whose  raw  scores,  based  on 
different  levels  of  a  test,  are  available  at  two  different  times,  t>1  and  t_  . 

Let  the  raw  scores  for  individual  _i_  at  times  and  t_^  be  denoted  by  and  r,-  , , 

respectively.  It  is  assumed  that  at  each  time  point  the  raw  scores  are  ade¬ 
quately  described  by  a  Rasch  model  and  that  estimates  of  all  the  item  parameters 
are  available.  For  the  present  purpose  the  item  parameters  will  be  treated  as 
if  they  are  known.  The  tests  are  not  required  to  be  the  same  for  all  individu¬ 
als  or  to  be  the  same  at  times  _t_x  and  for  the  same  individual. 

Each  individual  i_  can  be  characterized  by  0^,  rj1(  and  0£2,  £i2,  S^2, 
where  0  £  2  and  0  £2  are  the  individual's  latent  abilities  at  times  _t_j  and  t2»  re¬ 
spectively;  Si  and  S2  are  the  sets  of  item  parameters  relevant  to  the  two  tests 
taken  by  the  individual;  and  rj^  and  r^2  are  the  raw  scores  defined  earlier.  It 
can  be  further  assumed  that  the  sample  under  consideration  is  drawn  from  a  popu¬ 
lation  of  individuals  whose  abilities  at  times  and  t^  follow  a  bivariate  nor¬ 
mal  distribution  with  means  p,  and  p2,  variances  and  a £,  and  correlation  co¬ 
efficient  P.  This  'type  of  longitudinal  model  has  been  used  by  Andersen  (1979) 
in  another  context. 


Representing  the  latent  abilities  in  this  population  at  times  t^  and  t^2  by 
the  generic  variables  0j  and  0Z,  the  joint  distribution  of  0j  and  0~  may  then  be 
specified  to  be  bivariate  normal  with  density  denoted  by  <J>  (0  x ,  02).  The  density 
(9 1  ,  02 )  resembles  a  Bayesian  prior  density.  However,  an  empirical  Bayes  ap¬ 
proach  is  followed  here  in  that  the  parameters  of  the  prior  density  are  estimat¬ 
ed  from  the  sample.  The  conditional  distribution  of  02,  given  0  ,  is  univariate 
normal  with  mean  and  variance 


£(e2 1 e i )  =  u2  + 


p(0 j  -  Hi) 


Var(62|ei)  -  a  22  (1  -  p2)  f  2  ] 

In  order  to  estimate  E(e2|0j)  and  Var(02|0j),  it  is  thus  sufficient  to  es¬ 
timate  the  parameters  p, ,  p2 ,  Oj ,  a2 ,  and  p  of  the  bivariate  density  $(0lt  02). 

In  this  problem  9j  and  02  are  latent,  or  unobservable,  variables  whose  charac¬ 
teristics — namely,  jjj  ,  p2  ,  alt  o2  ,  and  p — are  to  be  estimated.  The  estimation 
of  this  latent  structure  must  be  done  on  the  basis  of  indirect  observations  rep¬ 
resented  by  the  responses  of  the  individuals  in  the  given  sample  to  items  on 
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different  tests  at  _t  x  and  _t2.  A  method  for  estimating  latent  structure  in  a 
similar  situation  involving  a  univariate  latent  ability  distribution  has  been 
provided  by  Sanathanan  and  Blumenthal  (1978).  An  extension  of  this  method, 
which  is  discussed  in  the  following  section,  gives  the  required  estimates  for 
the  problem  considered  here. 

Once  estimates  of  tij,  p2,  alt  o2 ,  and  p  are  obtained,  a  specific  individu¬ 
al's  progress  over  the  period  from  t_3  to  t2  can  be  judged  as  follows:  Let  03 0 
and  02O  be  the  0J  and  02  values,  respectively,  for  a  given  individual.  Compute 
E(02j9,o)  and  Var(02|0lo),  use  them  to  express  the  absolute  gain  in  ability  for 
the  individual  as  a  percentile  gain,  which  in  turn  has  the  usual  interpretation. 

The  gain  02O  -  0jo  can  also  be  interpreted  in  terms  of  gain  in  years  as 
follows:  Given  an  individual  with  ability  0io,  the  expected  gain  for  this  indi¬ 

vidual  over  the  period  t_x  to  t_2  is  E(02]01O)  -  0 : 0.  Let  03  be  the  ability  of  an 
individual  at  time  t3  where  t3  -  t2  =  t2~  t  ,.  The  expected  gain  for  an  indi¬ 
vidual  with  ability  0  Q  over  the  period  to  t_3  can  be  computed  as  follows: 

*(e3|e10)  =  ^  [E(e3!e10,  e2>] 

-  fe2|e10  r£’<e3 

[ff(e2|e10)  -  Uj  a 2 


=  ff[02|01  =  £(02|01O)]  [3] 

Thus,  for  an  individual  with  initial  ability  0j,  E ( 0 2 | 6 1 ) — and  hence  ex¬ 
pected  gain  for  the  period  t_2  -  t_,  or  any  multiple  thereof — can  be  computed. 

The  expected  gains  can  then  be  plotted  against  the  corresponding  time  periods. 
Given  that  an  individual  with  ability  9J0  has  gained  02o  "  0io  in  ability,  ini¬ 
tial  ability  0JO  can  be  looked  up  in  the  expected  gain  chart  and  the  time  period 
corresponding  to  an  expected  gain  of  02O  -  0iO  can  be  determined  by  interpola¬ 
tion.  This  time  period  can  be  interpreted  as  the  gain  in  time  for  the  individu¬ 
al.  Depending  on  whether  this  gain  ’’s  less  or  greater  than  t_z  -  t_  p  the  indi¬ 
vidual  can  be  considered  as  below  or  above  average  in  performance. 

Latent  Structure  Estimation 


This  section  focuses  on  the  estimation  of  the  parameters  P2,  0.,  o2, 

and  P  of  the  bivariate  density  <K0X,  02).  As  pointed  out  above,  these  estimates 
provide  the  necessary  information  for  assessing  the  gain  02O  -  0  ( 0  in  an  indi¬ 
vidual's  ability.  (This  gain  itself  is  estimated  on  the  basis  of  the  individu¬ 
al's  responses  to  items  on  two  different  tests  and  an  assumption  of  the  Rasch 
model . ) 

Let  the  test  responses  for  an  individual  be  represented  by  vectors  V!  and 
V2  corresponding  to  the  tests  at  £_i  and  tz,  respectively.  Let  the  component 
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of  Vr  be  1  if  the  item  on  the  _kC^  test  is  solved  correctly,  and  0  otherwise. 

The  response  vectors  can  be  thought  of  as  being  generated  by  a  sequence  of  inde¬ 
pendent,  identically  distributed  latent  random  vectors  ( 6  j  x ,  9  i2  ) »  each  with  a 
bivariate  normal  distribution  with  density  $(0^  02).  Since  estimating  the  pa¬ 
rameters  of  <(>  (6  2  ,  0,)  is  of  interest,  the  estimation  would  ideally  be  based  on 
the  pairs  (0 ,  0j2)*  In  that  case,  the  maximum  likelihood  method  would  yield 

the  following  estimates: 


However,  since  the  (9^,  0£ 2 ) * s  are  not  directly  observable,  the  indirect 
observations  must  be  relied  upon,  namely,  the  response  vectors  to  make  the  ap¬ 
propriate  inferences;  and  it  is  plausible  to  substitute  E(  ®ikl^  i £ , V 2 £ ^  f°r  9ik* 

E(9|kl^ii>  v2i.)  f°r  ®ik>  anc*  E^®ii  ®izlvii>  v2i)  f°r  ®ii  ®i2  in  Equation  4. 

This  is  the  approach  followed  here  in  estimating  |ij,  u2,  ap  a2,  and  p.  The 
approach  is  based  on  the  missing  information  principle  (MIP)  formulated  by  Or¬ 
chard  and  Woodbury  (1972)  and  yields  the  maximum  likelihood  estimates  of  the 
parameters  in  question.  The  rationale  for  the  MIP  approach  is  provided  here  in 
an  intuitive  sense.  A  rigorous  explanation  is  provided  by  Sanathanan  and  Blu- 
menthal  (1978),  on  the  basis  of  which  it  is  evident  that  an  application  of  MIP 
in  this  situation  does  lead  to  maximum  likelihood  estimates. 

The  conditional  expectations,  such  as  E(9£k|V12,  V2£),  which  are  to  be  sub¬ 
stituted  for  the  corresponding  latent  variables  in  Equation  4  depend  on  the  val¬ 
ues  of  the  parameters  p2,  a1>  o2»  and  p,  which  are  themselves  unknown  and 
are  to  be  estimated.  The  MIP  approach  requires  that  the  values  of  these  parame¬ 
ters  and  those  satisfying  Equation  4  be  the  same.  This  equality  can  be  achieved  i 

through  the  following  iterative  procedure,  referred  to  as  the  EM  algorithm  by 
Dempster,  Laird,  and  Rubin  (1977),  who  also  show  the  convergence  of  this  type  of 
algorithm  in  a  much  more  general  setting. 
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Starting  with  trial  values  for  P2,  Oj,  a2,  and  P,  cycle  through  the  E- 
and  M-steps  given  below,  until  convergence  is  attained. 

E-step:  Compute  the  conditional  expectations  such  as  E ( 0 j  t ,  V2^), 

using  the  current  values  of  the  parameters  ult  u2,  Oj,  o2,  and  p. 

M-step:  Revise  the  parameter  values  by  using  Equation  4  and  the  condi¬ 

tional  expectations  from  the  E-step  in  place  of  the  latent  vari¬ 
ables  appearing  in  Equation  4. 

Let  jy/Qj.  82)  be  the  density  of  ( 0 ^ x ,  0£2),  conditional  on  the  response 
vector  (V21,  V22).  Then  g, (9 i ,  02)  is  given  by 


2  mk 


MV  v 


4>  (©  ,  6  )  TT  TT  (p  .fe(0fc)) 

1  2  k  =  1  ,jk- 1  °K  _ 


X  . 

tjk 


7 


2  mk 


i  ’  ez}  .V  *  (V(V> 

-«>  k=  1  jk=l 


x  . 

tjk 


a  -  Pjkv>k)) 


1  -  x  . 

tjk 


(1  -  PJk(Qk))1  XijkdQx  dQz 


[5] 


where 


4>(01,02)  is  the  bivariate  normal  density, 
k^  is  the  test  number, 

m^  is  the  number  of  items  on  the  k^*1  test, 

PjkCe)  is  given  by  Equation  1,  and 

xjjk  is  1  if  the  item  on  the  kt^1  test  is  answered  correctly 
by  the  i1-*1  individual,  and  is  0  otherwise. 


In  addition, 

E 


,  k  -  1,2 


and 


<9i y\ru-  viP  -foy  9£<ei’  V  d\  d\ 

' 'i\’  Vi2)  '“i19!’  92)  i9l  <f0! 

««£,  ei2  I  Vil  >  Vi2)  9£2  ?i(6r  V  d\  ^2  ' 


,  k  =  1,2 
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If  each  of  the  two  tests  at  times  t  j  and  t_2  are  the  same  for  all  individu¬ 
als,  then  £i(e j,  82)  is  the  same  for  all  individuals  with  the  same  raw  scores 
^£i  1 »  £i2^’  It  is  then  enough  to  consider  the  conditional  expectations  in  Equa¬ 
tion  6  for  all  possible  pairs  of  raw  scores.  On  the  other  hand,  if  the  individ¬ 
uals  are  administered  different  tests  at  any  particular  time,  then  the  condi¬ 
tional  expectations  in  Equation  6  must  be  evaluated  separately  for  each  individ¬ 
ual.  Here  again  the  individual's  raw  scores  on  the  two  tests  are  sufficient  for 
evaluating  the  conditional  expectations  in  Equation  6.  Basically,  the  expres¬ 
sions  in  Equation  6  can  be  rewritten  by  noting  that 


<j>(0j,  e2)  exp  (e^j  +  02r-i2^ 


k 

7T  7T  (1  +  exp  {9.  +  4>  ..}) 

k= 1  jk= 1  K  3K 


[7] 


where  <f>j^  is  the  item  parameter  for  the  j_th  item  on  the  k^*1  test  taken  by  the 
i  individual  and  is  assumed  to  be  known  (or  estimated  separately). 


Computing  the  expectations  in  Equation  6  calls  for  numerical  integration, 
which  is  done  by  using  the  FORTRAN  version  of  CACM  Algorithm  145  called  ASIMPS. 
This  is  the  same  program  that  was  used  for  the  computations  described  by  Sana- 
thanan  and  Blumenthal  (1978). 


A  remark  concerning  the  accuracy  of  estimation  is  in  order.  As  in  regres¬ 
sion  analysis,  for  the  estimation  of  E ( 0 2  |  0  2  Q)  the  best  accuracy  is  obtained 
when  610  is  the  same  as  or  close  to  the  mean  ability  Pj  of  the  group  used  for 
estimating  the  parameters  of  <)>(  0  2 ,  02).  For  adequate  estimation,  there  must 
therefore  be  several  samples  of  which  the  mean  abilities  are  spread  ove-*  the 
range  of  interest.  For  a  given  initial  ability  01O,  E(02  |©Xo)  would  then  be 
computed  using  estimates  of  parameters  based  on  the  sample  whose  mean  ability  is 
closest  to  610. 

Numerical  Illustration 


The  procedure  which  has  been  described  for  estimating  the  parameters  Pj, 

P2,  Oji  °2 >  an<*  p>  of  the  bivariate  density  $(0If  02)  is  illustrated  using  the 
following  synthetic  data.  Table  1  represents  the  responses  of  1,000  individuals 
to  tests  at  two  different  times.  Each  test  consists  of  four  items  whose  <|>,  es¬ 
timates  are  given  in  Table  2.  J 

The  maximum  likelihood  solution  is  obtained  by  an  iterative  procedure  in¬ 
volving  all  five  parameters  simultaneously.  A  computational  shortcut  is 
achieved  here  by  obtaining  estimates  for  p4  and  o  lt  and  p2  and  o2  separately, 
based  on  the  respective  marginal  distributions.  Although  this  procedure  is  not 
strictly  valid  by  the  maximum  likelihood  criterion,  it  is  an  acceptable  compro¬ 
mise  between  computational  efficiency  and  theoretical  rigor.  The  computational 
procedures  used  in  obtaining  these  estimates  are  as  follows:  Trial  values  for 
Wj,  and  p2,  a2  were  chosen  as  .1  and  1.0,  respectively.  In  each  case,  the 
initial  values  for  p  and  o  and  the  relevant  cf> j  values,  and  the  relevant  marginal 
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Table  1 


Bivariate 

Frequency 

Distr ibut ion 

of  Raw 

Scores 

Time  ti 

Time 

t2 

Raw 

Raw  Score 

Row 

Score 

0 

1 

2 

3 

4 

Margin 

0 

32 

25 

22 

18 

2 

99 

1 

53 

69 

42 

40 

8 

212 

2 

33 

75 

104 

95 

7 

314 

3 

25 

29 

139 

146 

9 

348 

4 

Column 

1 

3 

7 

5 

11 

27 

Margin 

144 

201 

314 

304 

37 

raw  score  frequencies  were  entered  into  a  computer  program  for  carrying  out  the 
E-  and  M-steps  outlined  above.  This  part  of  the  computation  involves  only  the 
respective  conditional  means  and  variances  (and  not  covariances)  and  marginal 
distributions  of  Qx  and  02  separately.  After  five  iterations  the  final  esti¬ 
mates  of  pj  and  Oj  obtained  were  -.32  and  .839,  respectively.  The  estimates  for 
U  2  and  o2  were  obtained  after  two  iterations  as  .14  and  1.007,  respectively. 


Table  2 

<t>j  Estimates  of  Test  Items 


Item 

Test 

1 

2 

1 

-.4033 

-1.5921 

2 

.4476 

.3064 

3 

.4791 

-1.0051 

4 

.6743 

1.0932 

For  estimating  p  the  values  of  pj,  Oj  and  p2,  cr2  were  treated  as  if  known, 
and  their  estimates  were  inserted  into  the  expression  for  jjlOj,  02),  the  generic 
density  of  (6j,  02)  conditional  on  a  given  pair  of  raw  scores.  A  trial  value  of 
p  =  .8  was  used  for  evaluating  the  expectation  of  (0j,  02)  conditional  on  vari¬ 
ous  combinations  of  raw  scores,  constituting  the  E-step.  The  average  of  these 
conditional  expectations  was,  in  turn,  used  to  revise  the  value  of  p,  as  re¬ 
quired  by  the  M-step.  After  two  iterations,  the  p  estimate  obtained  was  .6. 


Related  Applications 


The  empirical  Bayes  procedure  used  in  assessing  longitudinal  progress  can 
also  be  applied  to  the  following  related  problems: 

Consider  the  problem  of  evaluating  the  effectiveness  of  a  new  program  or  a 
new  instructional  method.  There  are  usually  an  experimental  group  and  a  control 
group  that  are  to  be  compared  on  the  basis  of  "before"  and  "after"  test  scores. 
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Since  gain  in  ability  is,  to  some  extent,  dependent  on  initial  ability  level, 
for  a  meaningful  comparison  differences  in  the  initial  ability  levels  of  the 
groups  must  be  considered.  This  can  be  accomplished  as  follows:  Estimate 
E f 0 ? 1 0  )  for  the  groups  separately  and  average  the  resulting  functions  over  0  , 
using  a  common  marginal  distribution  for  8j  (this  could,  for  instance,  be  the 
ability  distribution  of  some  specified  norm  group).  The  averages  thus  obtained 
would  be  free  of  biases  resulting  from  differences  in  initial  ability  levels  and 
hence  are  comparable. 

Another  problem  to  which  the  empirical  Bayes  procedure  presented  in  this 
paper  is  applicable  is  that  of  estimating  the  correlation  coefficient  between 
two  tests  intended  to  measure  the  same  or  possibly  different  latent  traits.  To 
do  this,  let  the  latent  traits  to  be  measured  by  the  two  tests  correspond  to  0j 
and  92  in  the  empirical  Bayes  model  and  follow  the  procedure  described  for  com¬ 
puting  the  required  correlation  coefficient.  This  approach  circumvents  the  dif¬ 
ficulties  encountered  in  the  usual  approach,  where  and  02  are  first  estimated 
for  each  individual  in  a  given  sample  and  the  resulting  estimates  are  used  for 
computing  the  correlation  coefficient. 
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When  I  first  became  acquainted  with  computerized  adaptive  testing,  I  con¬ 
sidered  it  to  be  of  little  practical  importance,  for  which  psychologist  is 
equipped  with  a  sufficient  number  of  computer  terminals  and  has  access  to  a 
time-sharing  computer  system?  Therefore,  I  predicted  only  a  few  applications  of 
adaptive  testing  in  the  near  future.  The  actual  development  has  proved  me 
wrong.  The  advent  of  microprocessor  techniques  in  particular  is  making  adaptive 
testing  practical,  and  adaptive  testing  procedures  are  being  rapidly  developed 
along  with  the  spread  of  their  applications  in  large-scale  testing  projects. 
Moreover,  there  have  been  advances  in  the  theory  underlying  adaptive  testing, 
e.g.,  the  Bayesian  approach.  An  effort  to  catch  up  with  this  progress  will  have 
to  be  made  in  the  European  countries,  where,  however,  the  number  of  testees  is 
usually  much  smaller  than  in  the  U.S.,  rendering  the  economic  aspects  of  adap¬ 
tive  testing  somewhat  different. 

Although  the  theoretical  advantages  of  adaptive  testing  cannot  be  disputed 
in  principle,  caution  should  be  exercised  against  being  over-enthusiastic  about 
adaptive  testing,  since  results  from  empirical  applications  might  turn  out  some¬ 
what  less  favorably  than  in  theory. 

Adaptive  testing  has  become  possible  only  through  the  various  strong  true- 
score  theories,  which — in  contrast  to  the  tautological  asumptions  of  classical 
test  theory — attempt  to  force  the  responses  of  subjects  into  the  corset  of  re¬ 
strictive  model  assumptions;  the  bonus  from  these  assumptions,  however,  is  that 
there  is  a  basis  for  explaining  observed  behavior  in  terms  of  certain  item  and 
person  parameters,  so  that  the  chances  of  a  subject  to  solve  any  additional  item 
can  be  predicted  from  previous  responses,  and  thus  an  appropriate  item  can  be 
chosen.  The  validity  of  this  procedure,  as  well  as  that  of  the  test  results, 
rests  wholly  on  the  validity  of  the  model  used,  and  it  must  be  required  to  hold 
for  each  and  every  subject.  No  non-fitting  subjects,  such  as  Lumsden's  (1980) 
"lazy  subject"  who  responds  inadvertently  to  an  item,  are  allowed;  no  systematic 
differences  between  subjects  or  groups  of  subjects  with  respect  to  the  ROC 
curves  are  allowed,  either.  Hence,  the  ROC  curves  must  be  the  same  for  all  sub¬ 
jects  or,  more  practically,  for  all  relevant  groups  of  subjects  within  the  popu¬ 
lation  of  interest.  There  will  have  to  be  a  comparison  of  the  results  of  item 
calibrations  in  subsets  of  subjects  who  differ  as  much  as  possible  in  some  rele¬ 
vant  variables,  such  as  age,  sex,  socioeconomic  status,  education,  and  ability. 
Only  if  the  KOC  parameters  come  out  the  same  in  all  such  subgroups  will  the  mod¬ 
el  hold  with  sufficient  accuracy  to  allow  adaptive  testing. 

The  question  arises,  If  such  studies  are  undertaken,  is  there  much  hope  for 
attaining  stable  results?  To  be  more  concrete,  are  the  same  item  parameters 
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really  obtained,  e.g.,  in  groups  of  very  bright  and  groups  of  rather  dull  exam¬ 
inees?  In  view  of  the  generally  acknowledged  difficulties  of  estimating  the 
guessing  and  discrimination  parameters  at  all,  it  is  doubtful  that  the  estimates 
of  these  parameters  would  show  only  sampling  errors  when  estimated  from,  say, 
groups  differing  radically  in  average  latent  ability.  If  I  am  correct,  however, 
the  consequence  is  that  the  validity  of  adaptive  testing  procedures  based  on  the 
ROC  parameters  must  be  doubted. 

Where  does  that  leave  us?  Should  we  not  resort  to  a  model  that  is  based  on 
the  principle  that  item  parameter  estimates  must  be  independent  of  the  sample  of 
subjects,  i.e.,  where  the  parameter  estimates  are  "sample-free?"  Of  course,  no 
model  can  guarantee  what  the  data  will  be  like,  but  the  model  should  have  a  for¬ 
mal  structure,  which  in  principle  enables  the  estimation  of  the  item  parameters 
independently  of  the  ability  distribution  in  the  sample  of  subjects.  In  other 
words,  this  leads  directly  to  the  Rasch  model. 

There  are  some  important  advantages  of  the  Rasch  model  with  respect  to 
adaptive  testing  that  have  not  been  discussed  at  this  conference  so  far:  By 
putting  a  linear  structure  into  the  item  parameters,  yielding  the  so-called  lin¬ 
ear  logistic  test  model  (LLTM),  one  can — at  least  in  certain  domains  of  ability 
testing — explain  the  item  difficulty  in  terms  of  more  elementary  cognitive  oper¬ 
ations.  This  entails  the  possibility  of  defining  large  unidimensional  universes 
of  test  items  where  each  item  has  a  difficulty  parameter  predicted  from  the  log¬ 
ical  structure  of  the  item.  The  LLTM  has  been  applied,  for  example,  to  materi¬ 
als  similar  to  the  Raven  Progressive  Matrices,  however  with  items  constructed 
systematically  on  the  basis  of  a  defined  set  of  cognitive  operations.  The  uni¬ 
verse  of  these  items  is,  in  principle,  unlimited;  but  in  practice,  of  course, 
just  a  fairly  large  set  of  items  is  obtained.  Such  items  have  been  used  by 
Fischer  and  Pendl  (1977)  for  the  purpose  of  a  simple  adaptive  testing  strategy 
that  can  be  applied  without  using  a  computer. 

Besides  the  theoretical  nicety  of  the  LLTM,  which  explains  item  difficulty 
on  the  basis  of  a  psychological  microtheory,  and  besides  its  applicability  to 
adaptive  testing  strategies,  the  LLTM  permits  an  investigation  of  other  types  of 
problems,  which  could  be  considered  as  further  advances  in  latent  trait  theory. 
It  lends  itself  to  analyzing  the  effects  of  context,  item  position,  and  learning 
that  occurs  during  test-taking;  to  predicting  the  asymptotic  difficulty  of  cog¬ 
nitive  operations  and/or  of  items  after  infinitely  long  practice;  and,  generally 
speaking,  to  analyzing  the  effects  of  any  kind  of  experimental  condition  on  the 
probability  of  a  correct  response.  Furthermore,  these  linear  logistic  models 
have  been  developed  and  tentatively  applied  to  polychotomous  items,  which  yield 
more  detailed  information  than  the  dichotomously  scored  items.  Also,  the  appli¬ 
cation  of  the  polychotomous  Rasch  model  to  projective  test  data,  for  example, 
has  been  seen  to  be  quite  successful.  Going  beyond  the  LLTM-  and  LLRA-type  mod¬ 
els,  a  dynamic  extension  of  the  Rasch  model  has  been  developed  by  Kempf  (e.g., 
1977;  Kempf  &  Mach,  1975),  viewing  test-taking  behavior  of  the  subject  as  a  sto¬ 
chastic  process;  it  takes  response  contingent  "transfer  effects"  on  the  sub¬ 
ject's  ability  into  account. 

In  conclusion,  I  would  like  to  briefly  discuss  the  progress  of  latent  trait 
theory  beyond  the  traditional  domain  of  test  theory.  As  was  pointed  out  in  Fi¬ 
sher  (1980),  there  is  an  attempt  to  apply  latent  trait  theory  to  other  fields 
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than  traditional  ability  testing,  e.g.,  to  problems  in  applied  and  clinical  psy¬ 
chology.  One  major  problem  type  is  the  detection  and  assessment  of  change;  that 
latent  trait  theory  has  been  extended  to  multidimensional  item  sets  is  an  impor¬ 
tant  step.  Anyone  dealing  with  measurement  of  change  under  the  influence  of 
educational  programs  or  therapeutical  treatments  will  find  that  using  unidimen¬ 
sional  tests  for  measuring  change  means  leaving  out  many  criteria  (items)  that, 
according  to  the  applied  or  clinical  psychologist,  are  often  the  most  relevant 
ones.  A  homogeneous  test  is  something  beautiful  for  the  psychometrician,  but  it 
may  be  rather  useless  from  the  point  of  view  of  the  applied  psychologist. 
Therefore,  it  is  a  major  advance  that  latent  trait  models  can  be  adapted  to  mul¬ 
tidimensional  item  sets,  i.e.,  to  such  item  sets  as  are  approved  by  our  col¬ 
leagues  from  the  applied  departments. 

Latent  trait  models  have  also  been  devised  for  analyzing  types  of  observa¬ 
tions  that  are  quite  different  from  those  discussed  at  this  conference,  e.g., 
for  describing  social  interaction  in  groups.  Scheiblechner  (e.g.,  1977)  has 
developed  such  models  for  qualitative  observations  and  for  frequency  data. 

These  new  approaches  to  certain  problems  in  social  psychology  seem  to  be  quite 
promising.  I  believe,  therefore,  that  we  are  at  the  beginning  of  an  era  of  psy¬ 
chometrics  where  latent  trait  theory  will  be  greatly  generalized  so  as  to  become 
applicable  to  very  different  problems  in  experimental,  social,  and  applied  psy¬ 
chology  . 
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I  am  concerned  that  much  of  the  data  that  is  gathered  may  be  seriously  im¬ 
paired  by  students  who  do  not  cooperate,  which  was  not  a  problem  20  or  30  years 
ago  but  is  a  serious  problem  in  many  cases  today.  If  a  student  answers  half  the 
items  in  a  normal  fashion  and  then  answers  the  rest  of  the  items  at  random,  this 
will  create  problems  in  the  statistical  analysis.  If  some  of  the  students  in 
one  group  in  an  equating  study  respond  in  this  manner  and  those  in  the  other 
group  do  not,  the  value  of  the  study  could  be  destroyed. 

An  important  point  is  sometimes  ignored  in  the  consideration  of  adaptive 
testing:  Adaptive  testing  is  most  useful  when  it  is  necessary  to  measure  well 
at  both  extremes  of  the  ability  range.  It  is  not  at  all  useful  if  all  that  is 
needed  is  to  divide  a  group  of  people  into  those  who  will  be  accepted  and  those 
who  will  be  rejected. 

I  would  like  to  endorse  Lumsden's  suggestion  that  item  parameters  can  be 
estimated  much  better  if  an  extra  group  of  low  ability,  and  perhaps  an  extra 
group  of  high  ability,  is  added  to  the  group  of  subjects.  If  this  is  to  be 
done,  however,  it  would  be  very  difficult  to  use  a  Bayesian  approach,  since  it 
can  be  no  longer  be  assumed  that  ability  is  normally  distributed. 

I  was  interested  in  some  of  Yen's  results.  She  studied  the  difference  be¬ 
tween  estimated  0  (ability)  on  two  parallel  tests  as  a  function  of  ability  lev¬ 
el,  comparing  the  Rasch  estimates  of  ability  with  the  estimates  from  the  3-pa¬ 
rameter  model.  On  the  surface,  the  results  were  rather  startling.  The  3-param¬ 
eter  model  yields  smaller  differences  than  the  Rasch  model  at  high  ability  lev¬ 
els;  but  the  Rasch  model  yields  smaller  differences  at  low  ability  levels.  This 
gives  the  (mistaken)  impression  that  if  estimation  of  the  ability  of  high-abili¬ 
ty  level  people  is  desired,  the  3-parameter  model  should  be  used;  but  If  estima¬ 
tion  of  the  ability  of  low-ability  level  people  is  desired,  the  Rasch  model 
should  be  used. 

I  would  like  to  explain  what  I  think  is  occurring  here.  The  Rasch  esti¬ 
mates  of  ability  are  based  on  number-correct  score.  A  person  who  answers  20%  of 
the  items  correctly  has  a  standard  error  of  measurement  that  is  about  the  same 
as  if  he/she  had  answered  80%  correctly.  In  the  case  of  the  3-parameter  model, 
where  there  is  guessing,  it  is  obvious  that  low-ability  people  guess  frequently, 
which  introduces  random  error  into  their  scores;  so  it  is  expected  that  the 
standard  error  of  measurement  will  be  higher  at  low  ability  levels  than  at  high 
ability  levels. 
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Since  the  Rasch  estimator  is  based  on  number-correct  score,  there  is  no 
reason  for  the  score  of  a  low-ability  person  to  fluctuate  wildly;  thus,  there  is 
a  relatively  small  standard  error  under  the  Rasch  model.  Under  the  3-parameter 
model,  if  it  is  desired  to  estimate  the  ability  of  low-level  people,  the  diffi¬ 
cult  items  should  not  be  scored  but  thrown  away,  since  they  just  add  noise  to 
the  score.  To  go  to  an  extreme,  the  3-parameter  ability  estimate  for  a  low- 
ability  person  may  be  based  on  the  person's  responses  to  just  two  or  three  items 
out  of  the  entire  test.  Clearly,  in  this  extreme  case,  such  an  estimate  is  go¬ 
ing  to  have  a  large  standard  error.  Nevertheless,  this  is  the  correct  way  to 
estimate  ability  if  there  is  guessing.  The  3-parameter  model  should  be  used  in 
spite  of  the  fact  that  it  gives  this  large  standard  error. 

There  is  a  problem  correlating  estimates  of  6.  At  least  in  conventional 
testing,  it  is  quite  likely  that  some  people  will  be  found  whose  maximum  likeli¬ 
hood  estimate  of  ability  is  at  -°°  (In  tailored  testing  this  will  be  avoided  if 
there  are  enough  easy  items  in  the  pool).  If  there  are  a  finite  proportion  of 
people  with  0  of  -»,  it  is  obviously  impossible  to  compute  means  and  variances 
and  correlations  of  0.  I  do  not  think  excluding  these  people  is  a  solution;  the 
results  would  depend  on  the  vagaries  of  the  situation-— on  how  many  people  are  at 
-50,  how  many  at  -40,  and  so  on. 


For  most  purposes,  it  really  does  not  matter  very  much  whether  a  person's 
ability  is  estimated  to  be  -6  or  -20.  If  it  did  matter,  clearly  we  should  not 
have  given  the  person  the  test  we  did,  we  should  have  given  him/her  an  easier 
test  that  would  allow  the  accurate  determination  of  whether  he/she  is  at  -6  or 
-20.  That  we  did*  not  give  him/her  such  an  easy  test  suggests  that  we  do  not 
care  whether  he/she  is  at  -6  or  -20.  If  this  is  true,  then  it  is  clearly  wrong 
to  use  a  numerical  scale  that  attaches  much  importance  to  such  a  difference. 

If  a  Bayesian  estimation  procedure  is  used,  estimates  of  — «  will  not  be 
obtained.  This  really  does  not  get  at  the  basic  problem,  however,  which  is  that 
differences  at  the  extremes  of  the  scale  are  not  very  important.  The  only  way 
to  eliminate  this  difficulty  is  to  transform  the  scale  and  to  use  numbers  that 
represent  faithfully  whatever  importance  is  attached  to  the  differences. 

A 

One  way  to  do  this  is  to  transform  each  0  into  an  estimated  number-correct 
true  score,  which  is  a  monotonic  transformation.  The  number-correct  score  scale 
is  the  kind  of  scale  that  we  are  accustomed  to  using.  The  fact  that  we  often 
work  with  number-correct  scores  suggests  that  this  scale  reflects  the  kinds  of 
differences  considered  important. 

A  0  of  -6  and  a  0  of  -20  will  both  transform  to  a  true  score  very  close  to 
zero.  That  takes  care  of  'lie  problem.  Means  and  standard  deviations  can  then 
be  computed;  and  different  testing  procedures  or  different  teaching  procedures 
or  different  estimation  procedures,  or  whatever  it  is  we  need  to  compare,  can  be 
compared  on  this  scale. 

The  last  point  is  a  problem  on  which  I  am  currently  working,  which  I  think 
is  rather  important:  ways  to  correct  for  the  bias  in  various  quantities  that  are 
estimated  by  L0GIST.  Bias  is  of  particular  concern  when  doing  repeated  equat- 
ings.  At  Educational  Testing  Service,  Form  H  is  equated  to  Form  G,  Form  I  to 
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Form  H,  Form  J  to  Form  I,  Form  K  to  Form  J,  and  so  on.  Sometimes  there  are  12 
new  forms  a  year.  If  there  is  a  small  bias  in  each  of  these  equatlngs,  due  to 
the  fact  that  the  parameter  estimates  are  biased,  the  bias  will  accumulate  over 
a  period  of  time  and  become  rather  serious. 


James  Lumsden 

University  of  Western  Australia 


“Trotsky  no  doubt  said  many  foolish  things.  But  one  wise  thing  he  said 
was,  'Belief  without  action  is  death!'  What  do  test  theorists  believe?  How  do 
they  act?  Belief  without  action  is  death.  Are  we  all,  then,  test  theorists, 
dead?  Yes.  And  not  even  decent  corpses  enriching  the  earth  in  which  we  decom¬ 
pose.  We  must  learn  to  live." 

My  confidence  in  the  truth  of  the  statement  above  (taken  from  a  sermon  in 
honor  of  Oscar  Buros)  has  been  shaken  by  events  of  the  past  few  months,  and  par¬ 
ticularly  of  the  past  few  days.  The  younger  test  theorists  seem  more  sensitive 
to  problems  and  more  willing  to  act  than  1  had  expected. 

There  are  problems.  The  papers  of  this  conference  have  consistently  re¬ 
vealed  a  crisis  in  adaptive  testing.  The  expensive  apparatus  constructed  by  the 
psychoarithmeticians  has  not  delivered  as  promised.  It  has  given,  at  best,  me¬ 
diocre  results  and  on  too  many  occasions  results  that  are  odd — Indeed,  incon¬ 
ceivable  if  the  model  even  remotely  holds. 

Most  of  the  difficulties  are  with  the  3-parameter  model,  and  perhaps  the 
great  arithmeticians  will  solve  them.  However,  this  is  unlikely.  There  are 
strong  theoretical  grounds  for  the  belief  that  there  can  be  no  satisfactory  so¬ 
lution.  What  can  be  done  about  it? 

The  multiple-choice  item  can  be  abandoned  wherever  possible  and  completion- 
type  items  can  be  used.  There  is  already  available  a  useful  range  of  tests  that 
can  be  given  in  that  form,  for  example,  standard  well-tested  items  from  intelli¬ 
gence  tests:  number  span  (forward,  back,  simultaneous,  successive),  number  se¬ 
ries,  letter  series,  mathematical  problems,  and  code  substitution.  If  a  program 
can  be  found  to  "normalize"  spelling  (or  if  we  are  prepared  to  Include  spelling 
as  part  of  the  systematic  variance),  then  synonyms,  antonyms,  and  verbal  analo¬ 
gies  can  be  added.  This  is  no  trivial  list.  And  it  would  seem  highly  likely 
that  imaginative  use  of  the  flexible  delivery  made  possible  by  computers  will 
greatly  enlarge  the  possibilities  for  completion  items. 

Abandonment  of  adaptive  testing  with  multiple-choice  items  would  thus  avoid 
the  necessity  for  precise  estimation  of  item  parameters.  Efficient  adaptive 
testing  is  only  possible  when  discrimination  over  a  relatively  wide  range  of 
ability  is  required  and  when  the  discriminatory  power  of  the  items  is  relatively 
high.  For  all  other  cases  conventional  testing  is  indicated.  How  should  the 
conventional  test  be  scored?  If  the  item  characteristic  curve  (ICC)  procedures 
are  preferred  and  the  uncertainties  of  estimation  in  the  3-parameter  model  can 
be  tolerated,  they  may  be  used. 
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I  cannot  bring  myself  to  call  anything  a  true  score,  and  I  suggest  that  a  better 
name  for  is  the  "estimated  raw  score."  The  raw  score  is  a  good  estimator  of 

the  estimated  raw  score,  typically  accounting  for  over  95%  of  the  variance.  For 
most  purposes,  the  raw  score  will  do  everything  that  is  needed  without  any  need 
to  consider  very  seriously  the  item  parameters. 

A  more  powerful  alternative  to  adaptive  testing  is  sequential  testing, 
which  does  not  seem  to  have  been  seriously  treated.  On  the  basis  of  a  short 
routing  test,  subjects  can  be  rejected,  selected,  or  given  further  testing. 

With  appropriate  tests  it  should  be  possible  to  better  the  performance  of  the 
best  conventional  tests  and  to  match  that  of  good  adaptive  tests. 

No  one  has  spoken  at  this  conference  about  test  construction — about  proce¬ 
dures  for  forming  and  improving  item  banks.  This  should  be  a  matter  of  prime 
concern,  for  obviously  no  amount  of  arithmetic  is  going  to  overcome  the  problems 
of  a  badly  constructed  test.  My  preference  is  for  factor  analytic  procedures. 
These  may  be  used  in  some  cases  to  construct  a  strictly  unidimensional  test.  In 
others,  factor  analysis  may  be  deliberately  used  to  construct  a  heterogeneous 
test.  The  classical  item  analysis  procedures  may  operate  to  exclude  a  precious 
group  of  items  measuring  an  important  criterion-relevant  ability  that  is  not 
measured  by  the  great  majority  of  the  other  items.  Factor  analysis  gives  the 
choice  of  making  two  tests  or  a  single  heterogeneous  test. 

Careful  test  construction  with  completion  type  items  is  the  only  way  to 
achieve  a  fit  to  the  1-parameter  Rasch  model.  When  items  are  constructed  ac¬ 
cording  to  a  strict  specification  and  tested  by  factor  analysis,  then  it  can  be 
guaranteed  that  the  slopes  of  the  ICCs  will  be,  at  least,  highly  similar. 

Finally,  let  me  suggest  that  the  proper  attitude  for  a  test  theorist,  in¬ 
deed  any  theorist,  is  lighthearted,  even  playful.  I  notice  that  most  test  theo¬ 
rists  are  solemn.  Recall  the  Yerkes-Dodson  Law.  When  problems  are  difficult, 
grim  determination  is  a  disadvantage  rather  than  a  help.  All  theoretical  ad¬ 
vances  come  from  analogical  thinking .  One  should  try  to  develop  a  set  of  analo¬ 
gies  crammed  with  surplus  meanings  that  free  one  from  the  empty  mathematical 
formulations. 

I  recommend  that  you  all  start,  and  some  finish,  an  elementary  textbook 
that  sets  out  to  explain  ICC  theory  to  the  most  mathematically  inept  group,  say, 
clinical  psychologists  or  educators.  You  will  find  that  you  will  be  searching 
for  clarifying  examples  and  simple  analogies  to  make  the  message  comprehensible. 
The  most  important  spin-off  of  this  exercise  is  that  you  will  also  come  to  a 
deeper  understanding  and  intuitive  grasp  of  your  trade. 


i 


David  J.  Weiss 
University  of  Minnesota 


One  of  the  concerns  that  I  have  heard  expressed  during  this  conference  has 
been  the  problem,  "Do  responses  of  real  people  fit  the  ICC  model?"  I  began  to 
be  concerned  about  this  problem  some  time  ago  (Weiss,  1973),  resulting  in  my 
independent  discovery  of  Mosier's  (1940,  1941)  Person  Characteristic  Curve 
(PCC).  To  investigate  the  idea  of  the  PCC  and  to  see  whether  it  could  be  used 
to  test  the  fit  of  people  to  the  3-parameter  item  characteristic  curve  (ICC) 
model,  151  students  in  an  introductory  psychology  course  at  the  University  of 
Minnesota  were  administered  216  five-option  multiple-choice  vocabulary  test 
items.  The  items  were  then  split  into  subgroups  by  their  difficulty  (_b)  parame¬ 
ters,  and  9  strata  were  constructed  in  terms  of  difficulty  with  24  items  in  each 
stratum.  Each  stratum  was  split  into  two  parallel  substrata.  As  a  result,  for 
each  individual  there  were  18  peaked  tests.  Within  each  of  those  strata  and 
substrata  the  proportion  correct  for  each  individual  was  determined.  The  plot 
of  these  data  for  one  individual  is  an  observed  PCC.  Several  observed  PCCs  are 
shown  as  solid  lines  in  Figure  1.  These  curves  show  how  people  differ  in  terms 
of  how  they  obtain  different  proportions  correct  on  easy  items,  on  items  of  av¬ 
erage  difficulty,  and  on  difficult  items. 

Given  this  observed  data,  some  index  was  needed  of  whether  or  not  the  data 
from  these  students  fit  the  model.  Using  the  equation  for  the  3-parameter  lo¬ 
gistic  model,  the  ICC  parameter  estimates  for  the  items,  and  a  maximum  likeli¬ 
hood  ability  estimate  for  each  student  based  on  all  216  items,  the  estimated 
probability  of  a  correct  response  was  computed  for  each  item.  To  obtain  a  mod¬ 
el-predicted  proportion  correct  for  each  stratum  (and  substratum),  these  esti¬ 
mated  probabilities  were  summed  for  each  stratum  (and  substratum)  and  divided  by 
the  number  of  items  in  the  stratum  (or  substratum).  Thepe  model-predicted  val¬ 
ues  are  shown  in  Figure  1  as  dashed  lines  for  each  indi’  ldual.  Their  location 
along  the  ability  continuum  is  a  function  of  the  values  of  the  items  and  the 
ability  estimate  for  the  individual;  the  slope  of  the  model-predicted  PCC  is  a 
function  of  the  item  discriminations  and  guessing  parameter  values. 

The  fit  of  each  person's  observed  PCC  data  to  the  model-predicted  data  was 
determined  by  a  chi-square  test.  Results  showed  that  about  90%  of  the  students 
did  not  deviate  significantly  from  the  model  at  the  5%  level.  It  was  thus  en¬ 
couraging  to  see  that  the  responses  of  most  of  the  students  fit  the  model;  Fig¬ 
ures  la  and  lb  illustrate  PCC  data  for  two  students  whose  responses  did  fit  the 
model.  To  determine  if  any  of  that  10%  group  reliably  did  not  fit  the  model, 
PCCs  for  each  person  were  determined  separately  for  each  of  the  parallel  sub¬ 
strata,  and  for  each  person  two  indices  of  fit  were  computed.  Chi-square  values 
for  the  first  set  of  substrata  were  plotted  against  those  for  the  second  set. 


Proportion  Correct  (p)  m  Proportion  Correct  (p) 
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and  individuals  whose  data  were  significant  at  the  5%  level  for  both  chi-squares 
were  identified.  This  analysis  identified  a  small  group  of  students  who  reli¬ 
ably  did  not  fit  the  model.  The  observed  and  expected  PCCs  for  two  members  of 
this  group  are  shown  in  Figures  lc  and  Id.  These  figures  show  two  different 
patterns  of  non-fit  to  the  3-parameter  model.  But  the  major  conclusion  was  that 
the  vast  majority  of  the  students  did  perform  in  accordance  with  the  3-parameter 
model  (a  complete  report  of  this  study  is  in  Trabin  &  Weiss,  1979). 

Another  theme  that  was  prevalent  at  this  conference  was  the  question  of 
whether  adaptive  testing  should  be  used  at  all.  Lumsden  said  it  should  not; 

Lord  said  it  should  not;  Fischer  said  it  should  not;  I  say  it  should.  However, 
we  should  carefully  evaluate  the  question  of  fixed  length  versus  variable  length 
adaptive  tests.  Although  several  psychometricians  supported  fixed  length  adap¬ 
tive  tests,  1  believe  that  variable  length  tests  are  more  appropriate  than  fixed 
length  tests.  This  belief  is  based  on  data  from  the  Bayesian  posterior  vari¬ 
ances  or  the  estimated  standard  errors  of  measurement  for  individuals  taking  an 
adaptive  test;  at  any  given  item  length  there  are  individual  differences  in 
those  error  estimates.  Some  individuals  are  more  precisely  measured  at  a  given 
number  of  items  than  others;  and  this  is  a  function  of  the  individuals  taking 
tests,  not  a  function  of  the  item  parameters  themselves.  It  is  also  a  function 
of  the  specific  items  that  those  individuals  took.  Not  all  items  in  any  real 
item  pool,  regardless  of  how  ideal  it  is,  will  be  equally  distant  from  the  abil¬ 
ity  level  of  every  person.  Consequently,  as  long  as  item  parameters  differ  in 
the  pool,  if  items  are  selected  to  maximize  some  function  for  an  individual,  any 
two  individuals  will  obtain  different  errors  of  estimate/measurement.  When  that 
happens,  variable  length  adaptive  tests  are  more  appropriate  than  fixed  length 
adaptive  tests.  Testing  should  therefore  continue  until  the  level  of  precision 
desired  is  obtained.  Test  length  will  then  vary  based  on  how  each  particular 
individual  happens  to  interact  with  that  particular  subset  of  items;  that  inter¬ 
action  may  include  personality  characteristics,  such  as  risk-taking,  that  affect 
test  scores  but  are  not  on  the  same  dimension  that  is  being  measured  with  a  par¬ 
ticular  subset  of  items. 

A  third  problem  that  I  have  observed  throughout  this  conference,  mentioned 
earlier  by  Lord,  which  has  still  not  been  solved,  is  the  scoring  problem  for 
latent-trait-based  procedures.  The  Bayesian  scoring  procedure  that  is  now  popu¬ 
lar  has  the  problem  of  regressing  ability  estimates  toward  the  mean.  This  means 
that  there  are  some  individuals  whose  true  ability  levels  are  two  or  three  stan¬ 
dard  deviations  away  from  the  mean  but  whose  ability  estimates  will  be  less  ex¬ 
treme.  The  result  is  less  discrimination  among  those  whose  ability  is  very  high 
or  very  low  using  the  Bayesian  procedure.  This  problem  needs  to  be  resolved. 

One  solution  would  be  a  distribution-free  Bayesian  scoring  procedure  that 
is  not  a  maximum  likelihood  procedure,  since  the  maximum  likelihood  procedure 
has  the  problem  of  an  inability  to  provide  ability  estimates  for  individuals 
with  unusual  response  patterns  (and  real  people  do  get  unusual  response  pat¬ 
terns)  or  for  individuals  who  answer  all  the  items  correctly  or  incorrectly. 

One  ad  hoc  solution  to  the  problem  with  the  maximum  likelihood  estimates  of 
ability  is  simply  to  look  at  the  data  by  plotting  the  likelihood  function  for  a 
response  pattern  that  does  not  converge.  This  may  help  in  uncovering  the  cause 
for  the  lack  of  convergence  and  will  show  where  the  likelihood  function  begins 
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to  flatten.  This  value  may  then  be  utilized  as  a  provisional  estimate  of  abili¬ 
ty.  This  may  be  better  for  selecting  subsequent  test  items  than  assigning  an 
arbitrary  -10,  -40,  or  -5  as  an  ability  estimate. 

The  £  parameter  in  ICC  theory  is  a  problem  in  estimation,  since  it  creates 
problems  in  the  estimation  of  the  a_  parameter  and  lowers  estimated  item  discrim¬ 
inations.  Guessing  also  introduces  into  test  scores  many  variables  that  are 
inappropriate.  Thus,  I  can  only  support  Lumsden's  suggestion,  that  the  multi¬ 
ple-choice  item  ba  retired  and  that  new  item  types  be  developed  that  are  free  of 
the  technology  under  which  testing  developed  70  years  ago.  The  new  test  item 
need  not  necessarily  be  completely  free  response.  There  are  other  kinds  of 
items  that  will  do  a  good  job  of  measuring  that  are  not  necessarily  free  re¬ 
sponse  items.  Although  free-response  (or  completion)  items  are  obviously  the 
ideal  toward  which  we  should  strive,  we  should  carefully  examine  our  test  items 
to  determine  whether  a  non-multiple-choice  format  can  be  used  so  as  to  eliminate 
the  guessing  problem  and  thereby  do  a  better  job  in  item  parameter  estimation 
and  individual  measurement. 

At  the  same  time  we  need  new  kinds  of  tests.  Given  the  capabilities  of  the 
computer,  now  that  we  have  it,  we  need  to  develop  new  kinds  of  tests  that  may 
not  be  based  on  latent  trait  theory  but  that  more  fully  utilize  the  capability 
of  the  computer  to  interact  with  an  individual  in  order  to  measure  abilities 
that  we  are  not  now  measuring.  I  hope  that  when  we  do  develop  these  kinds  of 
tests  that  we  avoid  the  multiple-choice  item  and  that  we  try  to  be  more  creative 
and  develop  testing  situations  that  will  more  truly  reflect  the  potential  actual 
performance  of  pepple  in  the  real  world  and  the  criteria  that  we  are  attempting 
to  predict. 

Now  that  we  are  using  computers  for  test  administration,  I  see  a  danger  in 
the  use  of  response  latency  information  without  carefully  examining  its  charac¬ 
teristics.  It  is  very  easy  now  to  collect  response  latency  data  on  an  individu¬ 
al  taking  a  test  item  and  to  use  those  data  in  ways  that  experimental  psycholo¬ 
gists  have  done  for  many  years.  But  there  is  a  critical  difference  between  what 
the  experimental  psychologist  does  and  what  the  psychometrician  does.  The  dif¬ 
ference  is  that  when  experimental  psychologists  use  response  latency  data,  they 
typically  take  numerous  observations  and  then  compute  mean  response  latencies 
for  individuals — their  means  are  computed  either  across  individuals  and/or  over 
replications  of  stimuli — and  those  means  average  out  many  random  fluctuations 
that  occur  in  real  data. 

When  psychometricians  look  at  latencies  for  individual  test  items  and  build 
models  about  response  latency  for  people  taking  individual  ability  test  items, 
they  might  build  those  models  on  much  irrelevant  data.  Before  such  models  are 
built,  the  psychometrician  should  observe  people  taking  an  ability  test.  What 
will  be  observed  as  components  of  response  latencies  are  people  scratching  their 
heads,  fixing  their  contact  lenses,  observing  others  walking  to  and  from  their 
testing  terminals,  or  just  plain  inattention  and  daydreaming,  rather  than  re¬ 
sponding  instantaneously  as  soon  as  they  have  arrived  at  the  correct  answer,  as 
the  models  will  posit.  As  a  result,  latencies  measured  at  the  individual  item 
level  will  include  many  random  components.  Elimination  of  these  kinds  of  dis¬ 
turbances  will  require  many  replications  of  items  with  similar  difficulties; 
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then  we  might  obtain  a  valid  estimate  of  the  response  latency  for  a  person  on  an 
item  subset  of  a  given  difficulty.  Thus,  before  we  attempt  to  use  response  la¬ 
tency  data  in  the  measurement  process,  the  reliability  and  validity  of  reponse 
latency  data  derived  from  ability  testing  situations  need  to  be  examined. 

An  additional  problem  in  adaptive  testing  that  needs  further  research  is 
the  dimensionality  problem.  All  of  latent  trait  theory  that  has  been  studied 
and  applied  to  date  is  based  on  the  unidimensional  case;  we  still  have  not  ade¬ 
quately  solved  the  multidimensional  case.  If  latent  trait  theory  is  to  be  ade¬ 
quately  used  in  many  practical  testing  situations,  the  multidimensional  case 
will  need  to  be  operationalized,  since  tests  cannot  always  be  made  as  unidimen¬ 
sional  as  we  would  like  to  have  them. 

Finally,  we  should  not  rely  totally  on  ICC  theory  for  adaptive  testing. 
There  are  ways  to  implement  adaptive  testing  that  do  not  require  ICC  theory 
(e.g.,  Weiss,  1975).  ICC  theory  will  be  useful  if  there  are  1,000  subjects  and 
80  items  (or  whatever  future  research  discovers  to  be  adequate)  on  which  to  pa¬ 
rameterize  test  items.  But  there  are  many  environments  where  such  item  pools 
and  sample  sizes  are  not  available.  In  these  cases  other  ways  of  doing  adaptive 
testing,  which  might  operate  more  effectively  than  ICC  methods  (e.g.,  Thompson  & 
Weiss,  1980),  should  be  considered. 
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John  B.  Carroll 
University  of  North  Carolina 
at  Chapel  Hill 


I  should  like  to  make  some  comments  about  the  person  characteristic  curve 
(PCC),  which  has  just  been  discussed  by  Weiss  (1980).  Before  moving  into  the 
field  of  cognitive  psychology,  I  was  a  test  theorist;  and  one  of  my  concerns, 
although  not  under  that  name,  was  actually  the  PCC.  tty  original  interest  stem¬ 
med  from  a  paper  by  Guilford  (1941)  in  which  he  claimed  that  a  factor  analysis 
of  the  10  subtests  of  the  Seashore  Test  of  Pitch  Discrimination  revealed  that 
more  than  one  ability  would  be  involved  in  performance  on  this  test.  Indeed,  he 
believed  that  three  factors  were  involved — one  for  easy  items,  one  for  items  of 
medium  difficulty,  and  one  for  difficult  items. 

This  conclusion  made  absolutely  no  sense:  I  found  it  difficult  to  believe 
that  an  individual  who  could  not  make  an  easy  pitch  dicrimination  could  never¬ 
theless  detect  a  very  small  pitch  difference.  I  developed  the  statistical  ra¬ 
tionale  (Carroll,  1945)  whereby  I  was  able  to  convince  myself  that  Guilford's 
findings  were  an  artifact  resulting  from  the  use  of  tetrachoric  correlations 
with  the  scores  affected  by  chance  success— a  conclusion  that  Gourlay  (1951) 
confirmed  and  that  I  have  discussed  (see  Carroll,  1961). 

Underlying  this  rationale  was  the  notion  that  the  response  curve  of  an  in¬ 
dividual  to  items  of  varying  difficulty  measuring  a  single  trait  was  a  psycho¬ 
metric  function,  for  example,  a  normal  ogive  starting  at  probability  asymptotic 
to  unity  for  easy  items  and  descending  to  near  zero,  or  at  least  to  a  chance 
level  c^,  for  more  difficult  items.  Actually,  this  is  simply  a  version  of  a 
standard  psychophysical  function.  It  is  well  illustrated  with  data  from  the 
Seashore  Test  of  Pitch  Discrimination,  which  contains  10  subtests,  each  with  10 
two-choice  items  at  a  particular  level  of  difficulty  in  terms  of  the  difference 
(in  Hertz)  between  the  two  pitches  presented  for  a  judgment  as  to  whether  the 
second  pitch  is  higher  or  lower  than  the  first.  Because  of  unreliability  and 
chance  success  factors,  the  response  curve  for  any  single  individual  will  be 
rather  irregular;  but  mean  response  curves  for  individuals  at  different  total 
test  score  intervals  will  exhibit  the  form  illustrated  in  Figure  1.  In  effect, 
these  are  mean  PCCs  and  they  are  similar  to  Weiss's  (1980)  illustration  for  a 
vocabulary  test. 

I,  too,  have  plotted  such  curves  for  vocabulary  tests,  as  well  as  for 
achievement  test  items,  as  in  a  study  of  Navy  officer  candidate  examinations 
(Carroll  &  Schohan,  1953),  where  they  were  called  individual  operating  charac¬ 
teristic  curves.  I  have  tended  to  think  of  the  slopes  of  these  curves  as  indi¬ 
cating  something  about  the  trait  being  measured,  rather  than  the  individual. 

With  a  psychophysical  function  such  as  that  of  pitch  discrimination,  the  slopes 
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Figure  1 

Expected  Mean  Person  Response  Characteristic  Curves 
for  (A)  Low-Ability  Examinees  and  (B)  High-Ability  Examinees 
on  a  Test  of  a  Trait  such  as  Pitch  Discrimination  Ability 
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will  be  relatively  steep;  but  with  achievement  tests,  the  slopes  will  be  rela¬ 
tively  less  steep.  In  fact,  in  the  case  of  the  Navy  officer  candidate  examina¬ 
tions  (Carroll  &  Schohan,  1953),  the  slopes  were  so  low  as  to  indicate  that  the 
tests  were  "perfectly  heterogeneous  tests";  that  is,  they  were  the  slopes  that 
would  be  expected  for  tests  composed  of  items  differing  in  difficulty  but  with 
population  intercorrelations  (corrected  for  chance  success  effects)  equal  to 
zero.  Even  though  my  other  interests  and  commitments  have  never  permitted  me  to 
develop  this  kind  of  test  theory  as  much  as  I  would  have  wished,  I  recommend 
that  this  line  of  thinking  be  further  explored,  particularly  in  the  light  of 
latent  trait  theory.  (An  exposition  and  application  of  my  theory,  as  far  as  I 
carried  it,  is  to  be  found  in  a  doctoral  thesis  by  Dry,  1959.) 

One  interesting  point  emerges.  Contrary  to  some  opinions  that  have  been 
expressed  here— opinions  that  can  be  respected  in  view  of  the  reasons  given  for 
them — I  am  going  to  be  very  heretical  and  suggest  that  rather  than  "getting  rid 
of”  multiple-choice  items,  we  feature  them  in  our  work  but  make  them  two-choice 
instead  of  "multiple”  choice.  This  is  essentially  what  many  experimental  cogni¬ 
tive  psychologists  have  been  doing:  converting  multiple-choice  tests  to  a  two- 
choice  format  in  order  to  capitalize  on  certain  advantages  of  this  format.  For 
example,  Egan  (1976)  converted  several  of  Guilford's  spatial  ability  tests  to  a 
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two-choice  format,  primarily  In  order  to  obtain  more  valid  and  reliable  response 
latency  measurements.  Giving  the  respondent  a  two-choice  option  (a  true-false 
or  a  yes-no  option)  obviates  the  problem  of  time  wasted  in  scanning,  comparing, 
and  evaluating  a  large  number  of  choices.  This  is  one  advantage  of  the  two- 
choice  format.  Another  advantage,  from  the  standpoint  of  latent  trait  theory, 
is  that  the  c  parameter  can  be  determined,  in  many  circumstances,  by  a  priori 
considerations  as  equal  to  .5,  provided  that  the  examinee  is  led  to  believe  that 
the  probability  of  a  particular  response  being  correct  is  .5.  This  can  be  done, 
of  course,  by  insuring  that  equal  numbers  of  true-false  (or  yes-no)  items  are 
present  in  the  test  or  the  experimental  series. 

Note  that  experimental  psychologists  are  not  usually  interested  in  the  sub¬ 
ject's  latency  or  correctness  on  a  single  item;  they  take  measurements  over 
groups  of  similar  items  or  replicate  the  data  o”er  multiple  trials  or  trial 
blocks.  A  similar  approach  can  be  taken  in  tb  case  of  ability  or  achievement 
testing  without  increasing  testing  time  much,  f  at  all.  Actually,  constructing 
large  numbers  of  two-choice  items  is  easier  t.an  constructing  large  numbers  of 
five-choice  items.  However,  one  should  avoid  making  items  that  deliberately 
mislead  low-ability  examinees  into  making  incorrect  responses,  for  in  this  case 
the  £  parameter  can  easily  drop  well  below  .5. 

It  has  been  my  intention  in  this  discussion  to  mention  some  possibilities 
that  might  well  be  followed  up  in  future  work  on  the  applications  of  latent 
trait  theory  to  computerized  testing;  I  will  be  interested  in  watching  any  such 
developments. 
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